Entropy as Information

11 minute read

Published:

How is information quantified?

This is a big question that I had when first heard about information theory. When you or I think of words with a lot of information, we might imagine a juicy bit of gossip spread around office, or maybe a very interesting news article describing the latest political upset. But this kind of information is really subjective, meaning what is a lot of information for one person is old news to another. How do you quantify something as ambiguous as information?

First I’ll say that in information theory, information is not about the meaning behind the words. In fact, the words could be literally meaningless. They don’t even have to be words! Sure, we work with things that do ultimately have meanings behind them, but information theory doesn’t care about that. What does information theory care about then? In a nutshell, “surprise.” Information theory tries to quantify how surprised you are by a certain outcome. If you think about it, information and surprise are related in that if you are surprised by something, that something probably contains a good piece of information. Again, though, “surprise” is subjective. How can you quantify these ideas?

Information theory relies on the probability of something (or multiple things) happening in order to quantify surprise. If something is unlikely to happen, then you would be surprise if that thing happens. But since it is unlikely to happen, you will not be surprised on average. On the other hand, if something is very likely, then you won’t be surprised on average either. This notion is capture by entropy, which itself is a measure of uncertainty of a system. In this blog post, I talk about how I like to think of entropy and how it relates to information theory. Later, I’ll discuss how these ideas can be applied to autonomous exploration.

A crystal ball that predicts the weather

Imagine your friend asks you for a crystal ball for Christmas; specifically, this crystal should predict if it will rain or not rain today. This is an example of a binary predictor, where there are two mutually exclusive options to choose from. When your friend queries this crystal ball, it gives them a prediction of rain or no rain. For now, we’ll just look at the probability that the crystal ball is correct, $P(correct)$.

You don’t really like your friend, so you want to get them a crystal ball that produces the worst results possible. What would that be? Well, the best results would be a crystal ball that is always right, or $P(correct)=1$. You might reason that the worst result would then be a crystal ball that always gives the wrong answer, or $P(correct)=0$. Here is a table of what the completely wrong crystal ball might predict over the course of five days:

At first, you’re friend might be really peeved that they keep getting caught in the rain without an umbrella, but after they dry off they might realize that they can still use the crystal ball: if it is always wrong, then they just assume the opposite prediction that it gives to get the right answer. If it says it will rain, then your friend knows its a sunny day, and if it says there will be no rain that day, they better bring a coat!

Imagine taking a multiple choice test where each question has only two options. If you’re looking to always be wrong on each question, you have to know every right answer in one way or another. From the perspective of information theory, always being wrong is equivalent to always being right, since you need the same information to do either. But if this is the case, what makes the worst crystal ball?

The worst gift would be a crystal ball that is sometimes right, sometimes wrong. In fact, you’re intuition might be telling you that the worst case scenario is a crystal ball that is correct only $50\%$ of the time, and you would be correct. Think about it: if the crystal ball was, say, $90\%$ correct, then you know that an average you could trust what it tells you. But if the crystal ball was right only half the time, then you wouldn’t be able to tell at all if it will rain that day, not even on average. In other words, the crystal ball that has $P(correct)=P(incorrect)=0.5$ cannot reliably give you any information, and you will always be surprised by what the weather is when you actually walk outside. If instead $P(correct)=1$ or $P(incorrect)=1$, then you will never be surprised.

Information Content and Shannon Entropy

How can we further strengthen these notions of surprise when dealing with random variables like the weather? For a single outcome, we just saw that the amount of surprise for that outcome is related to its probability $p_i$. In information theory, surprise is quantified by something called the information content $I$, and is defined by the random variable $X$ and a particular outcome of that variable $O$, along with the probability of that outcome $p_O$:

\begin{equation} I_X(O) \equiv -\ln(p_O) \end{equation}

There is no strict derivation for this quantity, it has no physical meaning, and in a sense it is completely made-up by humans. But, it does have nice properties that make it ideal to work with. We won’t discuss them all here, but notice how if $p_O=1$, then $I_X(O)=0$. In English, if a particular outcome is certain to occur, then we are not at all surprised by the outcome of this random variable. On the other hand, if $p_O\rightarrow 0$, then $I_X(O)\rightarrow\infty$. This means that as an event becomes more and more unlikely to occur, then we become more and more surprised if we actually see that outcome.

In our crystal ball example, we can track its information content by considering whether its weather prediction is correct $p_c$ or incorrect $p_{~c}$. If $p_c=1$, then we are not at all surprised if the crystal ball correctly predicts the weather $I(c)=0$, but will be very surprised if it is incorrect $I(~c)=\infty$. Since the information content is tied to the outcome of a random variable, we cannot talk about the information content of the crystal ball; we can only state the information content of the crystal ball when it gives us a particular outcome. How can we talk about the crystal ball without referencing a certain outcome?

The answer comes by averaging over all possible outcomes, to find the average information content provided by the crystal ball. The average information content is given by a term that is call the entropy $H(X)$ that is tied to random variable $X$:

\begin{equation} H(X) = -\sum_Op_O\ln(p_O) \end{equation}

The information entropy quantifies how much information can be gleaned from a random variable $X$ on average. For a binary random variable and assuming both outcomes are mutually exclusive (like our crystal ball), the entropy is given by a sum of two terms:

\begin{equation} H = -p_c\ln(p_c) - (1-p_c)\ln(1-p_c) \end{equation}

Here is a graph of the entropy as a function of $p_c$:

The entropy of the crystal ball approaches zero when $p_c=1$ and $p_c=0$. In these scenarios, we are not surprised at the weather when we walk outside because the crystal ball told us, and the crystal ball is very accurate. If $p_c=0.5$, we are maximally surprised by the weather because the crystal ball is maximally inaccurate.

Using Entropy of Autonomous Exploring

This idea of entropy is used a lot in autonomous vehicle motion planning. In particular, it is used when a robot doesn’t know about its surroundings and must explore to create a map that it can use. Suppose that a robot is tasked with observing the occupancy of room A and room B. Somehow, it already knows the occupancy of room B (either it was given this information, or it had already observed the occupancy of that room). The robot must choose which room to go to:

The obvious choice is room A, since observing room B wouldn’t result in any net increase in information. In the context of information theory, we would say that the probability of occupancy of room B is either $p_B=1$ or $p_B=0$, depending on if the room is empty or not (in the above figure, room B is occupied by a pupper). In either case, the entropy of room B is zero. The probability of occupancy of room A on the other hand would be $p_A=0.5$, since we are completely uncertain of occupancy, meaning entropy is maximized at this probability (this is a case of binary classification, occupied or unoccupied). The robot could recognize that if room A is observed, its probability of occupancy would go from $p_A=0.5$ to either $p_A=1.0$ or $p_A=0.0$, reducing the total entropy of both rooms down to zero. Using information theory for a two-room map might seem like overkill, but consider a map made of more rooms:

As it explores, the robot has to decide what rooms it will explore, and in what order. In the above picture, a human might reason to explore rooms A-D first, simply because they are rooms with unknown occupancy that are clustered together. With information theory, a robot will do the same with a more analytical reason for doing so: by observing rooms A-D first, it will reduce the total entropy of the map more than if it traveled to and observed these four rooms first, more than any other four consecutive rooms (e.g. D-G only has two rooms of unknown occupancy).

In realistic applications, the robot is trying to create an occupancy map where it observes the occupancy of the space around it using grid cells. Each grid cell represents a small, discretized section of space that can either be occupied or unoccupied, just like the room example above. Here is a simple cartoon showing this idea:

In order to explore its surroundings, robots often seek to reduce the overall entropy of the entire map. By reducing entropy, it reduces the overall “surprise” of where obstacles are in the environment. This is, after all, the ultimate goal of mapping: to not be surprised by what the robot observes!

Conclusion

The main purpose of this blog post was to give an intuition behind entropy as a measure of information. Specifically, it is shown how entropy can be linked to the “surprise” or information available in a random variable. These ideas were then quickly linked to autonomous navigation when exploring an unknown environment. Although a detailed algorithm was not provided (maybe the subject of future blog posts), the general idea was proposed that a robot can create a better map by reducing the overall entropy of the probabilistic occupancy grid that details the known locations of obstacles in the environment.