# Reinforcement Learning 1: Policy Iteration, Value Iteration and the Frozen Lake

** Published:**

### First Steps in Reinforcement Learning

Reinforcement learning as a whole is concerned with learning how to behave to get the best outcome given a situation. Although there are many areas of application, the most well-known is video games. Given where you are in the virtual world and the position of the enemies around you, what’s the best action to take? Should you walk forward or jump on the platform above? These questions are split-second decisions for humans, but are non-trivial for a computer to figure out.

In practice, reinforcement learning operates a lot like how you are I might learn to play a video game. We give the computer goals to achieve and things to avoid, and it takes many attempts (called “episodes”) to figure out how to play the game; if there’s an enemy ahead, jump, else move forward. Codifying these ideas into a mathematical framework is the major idea behind reinforcement learning.

In this post, I’ll review the basic ideas behind reinforcement learning and discuss two basic algorithms - policy iteration and value iteration. I’ll also explain how to use OpenAI Gym, a popular python package used for testing different RL algorithms. With it, I’ll use policy iteration and value iteration to teach a computer how to walk on a frozen lake.

The next section is a very brief overview of basic concepts in RL. Although they aren’t difficult, there are many nuances and side discussions that are worth having but aren’t included for brevity. The reader is directed to Sutton and Barro’s Introduction to Reinforcement Learning, a text that deservedly holds a place as the first text almost everyone encounters when first learning about RL. The notation I use in this post is borrowed from that book.

# The Best Action to Take: The Foundation of RL

So, how does a computer know what to do in a given situation? In order to answer that question, we need to set the stage and define the terminology of reinforcement learning.

A state $s_t$ defines the what the player and/or environment looks like at time $t$. For example, $s_t$ might describe the position the player is at, or where the enemies are located. In this state, there is a set of actions that the player can take. Go left? Go right? Jump? Each action that a player can take is denoted as $a$.

If we are in state $s$ and decide to take action $a$, we are taken to some state $s’$ (which could, generally speaking, be the same state $s$). The mathematical description of what takes us from state $s$ to state $s’$ under action $a$ called the *model*. In some sense, one can think of the model as the equations of motion. If I am sitting in a car at rest (state $s$) and put the pedal to the metal (action $a$), then my car move forward and increase its speed (new state $s’$).

In general, there is some probability of entering state $s’$ from state $s$ under action $a$. This probability can be described as:

\begin{align} \label{eq:transition_noreward} p(s’|s,a) \end{align}

For those that might be unfamiliar with this notation, the above notation says: given some current state $s$ and action $a$ (terms to the right of the bar), what is the probability to transitioning to state $s’$ (terms to the let of the bar)? These transition probabilities describe a Markov Decision Process (MDP). An MDP is much like a Markov chain, accept the probability are also determined by an action that we (or a computer that plays for us) must choose. From a state $s$, choosing an action $a$ determines the probabilities and states $s’$ to which our system may transition.

Usually, we choose an action based on our current state $s$. And, just like humans, RL algorithms keep track of a strategy, or *policy*, that it uses when it encounters a state. This policy is denoted by $\pi(a|s)$, which is the probability of choosing action $a$ given that we are in the current state $s$.

Aside: one question you might have is why we bother to define probabilities state-action $(s,a)$ to new state $s’$, $p(s’|s,a)$, and probabilities from state to action, $\pi(a|s)$ separately; why not combine them as $\sum_a p(s’|s,a)\pi(a|s) = p(s’|s)$, summing over all possible actions from state $s$ to state $s’$ and eliminating this action variable? Well for one, usually only a single action $a$ can take you from $s$ to $s’$, so that sum reduces to a single term. But the bigger idea is that these two probabilities describe two very different things. The transition probabilities $p(s’|s,a)$ describe how the world works, and is largely out of our control. What actions we take, however, are in our control. In RL, finding the best policy $\pi(a|s)$ is the goal, or in other words, RL seeks to find the best action to take for any state $s$ we might find ourselves in.

The last piece of the puzzle is telling the computer when it does something good, and when it does something bad. This is done through the use of *rewards*. Rewards are typically define by certain states that the system can achieve. For Mario, his goal is to reach the end of the level, so when he reaches the end, we might give him a positive reward. If Mario hits a goomba or falls in a hole, we might give him a negative reward. Mario would then associate the actions that led him to his most recent outcome with a positive reward (if he completed the level) or a negative reward (if he died). This would help him update his policy for achieve a higher reward or help him avoid a negative reward.

Rewards are incorporated into the transition probabilities define above, so Eq. \ref{eq:trans_noreward} is altered only slightly:

\begin{align} \label{eq:trans_prob} p(s’,r|s,a) \end{align}

Eq. \ref{eq:trans_prob} is read as the probability of transitioning to state $s’$ and receiving reward $r$, starting in state $s$ and applying action $a$. As mentioned before, rewards are usually associated with states $s’$. Positive rewards are associated with desired states (e.g. goals in a game) and negative rewards are associated with undesired states (e.g. hitting a goomba). Other than that general guideline, rewards can be defined as anything we might want. In general, the policy that the RL algorithm ultimately learns is dependent on how we define the rewards, and there common cases where poorly-defined rewards cause undesired behavior. But that is something we can explore later.

With this whole system of rewards and transitions in place, what exactly do we want the RL algorithm to do? Every time the policy chooses an action and the system transitions from one state to another, the computer gets some reward $r$. This reward is cumulative, meaning each step in time gains some reward $R_{t+1} = R_t + r$, where we usually initialize $R_0=0$. The total reward is then dependent on the starting state and the subsequent actions that we took.

Suppose we are at time $t$ in the current game. If we are trying to decide what action to take next, one logical thing to do would be to try to look into the future and *maximize our future rewards*. If we currently have cumulative reward $R_t$, then the expected return $G_t$ is simply the sum of all future rewards:

\begin{align} \label{eq:return_nodiscout} G_t = R_{t+1} + R_{t+2} + … + R_{T} \end{align}

Here, $T$ is the maximum time of playing. Because each of these $R_{t+i}$ future rewards are generally stochastic variables, we would like to *maximize the expected return*. If there is no maximum time of playing $T$, the sum in infinite and there is sometimes a risk of a reward accumulating to infinity. In this case, the maximum expected return would also be infinite. Since infinites are usually difficult to work with mathematically, all future rewards from current time $t$ are “discounted” by a factor $\gamma$. The discounted return is defined by:

\begin{align} \label{eq:discounted_return} G_t = R_{t+1} + \gamma R_{t+2} + … = \sum_{k=0}^\infty\gamma^kR_{t+k+1} \end{align}

In order for this infinite sum to remain finite, the discount factor $\gamma\in[0,1]$. One can think of $\gamma$ as a factor that reduces the importance of rewards from the current state. If $\gamma=0$, then our strategy to maximize $G_t$ would be to take actions that immediately give us positive reward $r$. If $\gamma=1$, then our strategy would look at long-term rewards and perhaps would take a negative reward now for the chance of a bigger positive reward later.

Lastly, let’s discuss value functions and action-value functions. A value function $v_\pi(s)$ is simply the expected value for the return, given a policy $\pi(a|s)$, starting in state $s$:

\begin{align} v_\pi(s) = E_\pi [ G_t | S_t=s ] = E_\pi \left[ \sum _{k=0}^\infty \gamma^k R _{t+k+1}|S_t=s \right] \end{align}

Here, $E[ \cdot ]$ is the expectation value of a stochastic variable. For anyone unfamiliar, the expectation value is exactly the same as the average value: if start in state $s$ a million times and use policy $\pi$ each time, what will be the average reward that we get? As you might guess, getting this value (or at least a good estimate of this value) involves the state transition probabilities.

The action-value function is defined in a similar way, except it describes the expected return if we start in state $s$, take action $a$, and then afterwards always follow the same policy $\pi$:

\begin{align} q_\pi(s,a) = E_\pi [G_t | S_t=s, A_t=a ] = E_\pi\left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t=s, A_t=a\right] \end{align}

Note that $v_\pi(s) = q_\pi(s,\pi(s))$.

The value function and the action-value function are essentially relative measures of the policy $\pi$. If $\pi$ is a terrible policy, then the value function will be small (probably zero or even negative) values for all states $s\in S$. There is at least one policy that is better that all the rest, producing the maximum value function and action value function. In RL, the goal then is to find the policy that maximizes the value function for all state values:

\begin{align} v_*(s) = \text{max}_{\pi} v_pi(s) \end{align}

Optimal policies also share the same optimal action-value function:

\begin{align} q_*(s,a) = \text{max}_{\pi}q(s,a) \end{align}

Our goal for this post is to find the optimal policy, given transition probabilities $p(s’,r|s,a)$ and reward values $r$. The two standard ways of doing so is with value iteration and policy iteration.

# Value Iteration

We’ll first start with value iteration, as I believe it is the easier to understand conceptually. I’ll show the algorithm, then step through the first several iterations:

- Start with model $p(s’,r|s,a)$
- Initialize value function $v(s)=0$ for all states $s$
- Initialize new value function $v’(s)=0$ to be used in do-while loop
- do:
- for $s\in S$:
- $v(s) = v’(s)$
- $v’(s) = \text{max}_a \sum _{s’\in S} P(s’,r|s,a)\left( r + \gamma v(s’) \right)$
- while $|v’(s) - v(s)|_1 < \text{small number}$
- return v(s)

In order to show what this algorithm does, lets look at a simple example. Below is a 1D world comprised of seven squares. Our goal is to reach the rightmost square, and avoid the leftmost square. If we reach either, the game is over. In accordance with this goal, we set the following reward of +10 if we reach the right square, and a reward of -1 if we reach the left square. All other state-action pairs result in a reward of zero. For our small world, let $\gamma=1$.

Inside each square is the value for $v(s)$, which we initialize to zero everywhere:

Our action space is simple: we can move left, move right or stand still. Every time we take an action, we are certain to complete that action. This means that, for our model, our probabilities are all ones or zeros. For example, $p(s’=4,r=0|s=3,a=\text{right})=1$, and $p(s’=7,r=10|s=2,a=\text{right})=0$.

Now let’s go through the algorithm. We already initialized our value function to zero for all states (step 2), so now we enter the do-while loop. For all states, look at each action and the associated reward $r$ and discounted value $\gamma v(s’)$. For the first iteration of this loop, this simply returns the rewards associated with the two end-game states:

Line 6 of the algorithm simply says look at the biggest change in the value function. If the biggest change is bigger than some small number, then continue the loop. In our case, square 7 saw a change of +10. So we continue the loop.

Two quick notes:

- Rewards for states 1 and 7 are given only when you
*leave*a state. So no matter what action is taken in those states, you get the same reward and the episode is ended.

Because states 1 and 7 are end-game states, their values cannot change - we can’t get any more reward, because the game has ended! So we’ll focus on the other states and see how they change.

Below is a list of how the value of each state is updated on the second iteration:

- State 2: maximum action = stay, $v’(s=2) = r + \gamma v(2) = 0$
- State 3: maximum action = stay, $v’(s=3) = r + \gamma v(3) = 0$
- State 4: maximum action = stay, $v’(s=4) = r + \gamma v(4) = 0$
- state 5: maximum action = stay, $v’(s=5) = r + \gamma v(5) = 0$
- state 6: maximum action =
**right**, $v’(s=6) = r + \gamma v(7) = 10$

Now the value function looks like so:

Let’s go ahead and do the third iteration:

- State 2: maximum action = stay, $v’(s=2) = r + \gamma v(2) = 0$
- State 3: maximum action = stay, $v’(s=3) = r + \gamma v(3) = 0$
- State 4: maximum action = stay, $v’(s=4) = r + \gamma v(4) = 0$
- state 5: maximum action =
**right**, $v’(s=5) = r + \gamma v(6) = 10$ - state 6: maximum action =
**right**, $v’(s=6) = r + \gamma v(7) = 10$

The value function is now:

Notice how information of state seven’s reward of +10 propagates backwards to the other states. For this reason, line 7 in the above algorithm is sometimes called the Bellman Backup Operation. Every iteration, the value function we’re computing gets closer to the optimal solution, and it does so by ``backing up’’ the information of rewards to all other states.

The program terminates on the optimal value function:

The optimal value function above shows that every state can achieve a high reward. Given this value function, what is the optimal policy we should follow? That is simple: choose the action that maximizes $r + \gamma v_*(s’)$. In this example, though, you may notice a problem: there are states where what action to take is ambiguous. For example, both state three and state five have a value of +10, so that a policy trying to learn about state four has to deal with this ambiguity.

One simple way to address this problem is by setting the discount factor $0 \le \gamma\le 1$. Suppose we perform the same algorithm outlined above, but instead set $\gamma = 0.9$. The resulting optimal value function would look like:

Now it is clear where to go no matter what state we’re in - simply follow the path of increasing reward.

There is a second iterative to find the optimal policy, called policy iteration.

# Policy Iteration

As the name suggests, policy iteration iterates through policies until it converges on the optimal policy. Below is the algorithm:

- Start with model $p(s’,r|s,a)$
- Initialize policy $\pi(a|s)$ randomly for all states $s$
- Initialize current value function $v(s)=0$ for all states
- do:
- for $s\in S$:
- $v(s) = v(s’)$
- $v’(s) = \sum _{s’\in S, a \in A} P(s’,r|s,a)\pi(a|s)\left( r + \gamma v(s’) \right)$
- while $|v’(s) - v(s)|_1 < \text{small number}$
- Initialize new policy $\pi’(a|s)$
- for $s\in S$:
- $\pi’(a|s) = \text{max}_a \sum _{s’} p(s’,r|s,a)\left( r + \gamma v(s’) \right)$
- if $\pi’ \neq \pi$:
- go to line 4
- return $\pi(a|s)$

The idea is this: first, initialize a random policy. Then, find the value function for that policy (lines 4-8). With the value function, construct a new policy $\pi’(a|s)$ that maximizes $r + \gamma v(s’)$, i.e. immediate reward (by taking some action) plus discounted future rewards (by following old policy $\pi$ thereafter). If $\pi’ = \pi$, then you have found the optimal policy. If not, then return to line 4 with this new policy to relearn its value function.

This process has an extra step that value iteration, so it might be a little more confusing, but it isn’t too bad. To illustrate how this works, let’s go back to the 1D world, but instead let’s find the optimal policy using policy iteration instead. First, we start with an initial policy. In practice, this is usually randomized, but for our example let’s suppose we start with:

\begin{equation} \pi(a=\text{left}|s)_0 = 1 \quad \forall s \end{equation}

In english, this means that no matter what state we are in, we choose to move to the left. Lines 4-8 of the policy iteration algorithm simply converges to the value function for this policy, which is:

Unless we are already in state 7, we expect to get a reward of -1 for all states since we’re always moving to the left. This is terrible! But lines 9-10 iterate through each state and asks if there is a better immediate action that can be taken before following this policy. It will find:

- State 2: maximum action = left, $r + \gamma v(2) = -1$
- State 3: maximum action = left, $r + \gamma v(3) = -1$
- State 4: maximum action = left, $r + \gamma v(4) = -1$
- state 5: maximum action = left, $r + \gamma v(6) = -1$
- state 6: maximum action =
**right**, $r + \gamma v(7) = 9$

When the algorithm investigates state 6, it finds that going right is a better action to take than the current policy of going left. So now our policy is:

\begin{equation*} \pi(a=\text{left}|s)_1 = 1, \quad s = 1-5,7 \end{equation*}

\begin{equation*} \pi(a=\text{right}|s)_1 = 1, \quad s = 6 \end{equation*}

The policy is now to move right in state 6, and move left in all other states. Since our policy changed, line 11 says to go back and find the value function of this new policy. Unsurprisingly, this new value function will look like:

Hopefully you see where this is going. Again, information about the goal at state 7 is propagated back to all the other states, the optimal policy is to always move right, and the optimal value function will look like:

# Setting Up OpenAI Gym

Now that we have talked about the basics of RL and two algorithms that we can use, let’s use these techniques on a slightly more complex example.

OpenAI Gym is a pretty cool python package that provides ready-to-use environments to test RL algorithms on. As a python package, it is pretty easy to install:

```
pip install gym
```

They have all sorts of environments to play around in, and I encourage you to see all that it has to offer. But for this post, we’re going to use the frozen lake environment. This is a 2D grid a squares, and the goal is start in the upper left corner and reach the lower right corner while avoid some squares (‘‘holes’’ in the lake). The rest of this post will show code that uses value iteration and policy iteration to find the optimal policy to get to the goal while avoiding the holes.

Here is a link that contains starter code (for anyone who wishes to try this themselves), as well as completed code. The starter code was taken from Stanford’s RL Class.

So, let’s get coding.

# Frozen Lake: Policy Iteration

Inside the starter code folder, take a look at the vi_and_pi.py file; this is where you’ll add the code to make everything run. The main function contains a line that accesses the OpenAI library and loads the frozen lake environment:

```
env = gym.make("Deterministic-4x4-FrozenLake-v0")
```

In this environment, states and actions are denoted using integers. The 4x4 frozen lake defined above references its states as numbers 0 - 15, and all possible actions as 0 - 3 (up, down, left, right).

The env variables contains all the information needed for RL. Specifically, some of its fields include:

- env.P: a nested dictionary that describes the model of the frozen lake, where P[state][action] returns a list of tuples. Each tuple has the same form: probability, next state, reward, terminal. These describe:
- Probability: probability of transition to next state
- Next state: The possible next state described by the tuple
- Reward: The reward gained from this state-action pair
- Terminal: True when a terminal state is reached (i.e. hole or goal), False otherwise

- env.nS: the number of total states
- env.nA: the number of total actions

The file contains two functions called policy_iteration and value_iteration. These functions take in a frozen lake environment and perform policy iteration or value iteration until they converge to the optimal policy/value function, or the maximum number of iterations is reached.

Let us first look at policy iteration. To aid the coding process, the starter code also provides empty functions policy_evaluation and policy_improvement that are to be used when performing policy iteration. The policy_evaluation function returns the value function of the current policy (lines 4-8 of the policy iteration algorithm) and policy_improvement returns the improved policy using this value function (lines 10-12). Using these functions, here’s one way to fill policy_iteration:

```
def policy_iteration(P, nS, nA, gamma=0.9, tol=10e-3):
value_function = np.zeros(nS)
policy = 0*np.ones(nS, dtype=int)
############################
# YOUR IMPLEMENTATION HERE #
flag = True
i = 0
while flag and i < 100:
value_function = policy_evaluation(P, nS, nA, policy, gamma, tol)
new_policy = policy_improvement(P, nS, nA, value_function, policy, gamma)
diff_policy = new_policy-policy
if np.linalg.norm(diff_policy)==0:
flag = False
policy = new_policy
i+=1
if(i==100):
print("Policy iteraction never converged. Exiting code.")
exit()
############################
return value_function, policy
```

The above code first initializes the policy as all zeros for each state (i.e. always move to the left). It then finds the value function for this policy and finds improvement to this policy. If the improved policy is the same as the previous policy, then we have found an optimal policy and we exit the function; else, repeat the process.

Below are the completed policy_evaluation and policy_improvement functions.

```
def policy_evaluation(P, nS, nA, policy, gamma=0.9, tol=1e-3):
value_function = np.zeros(nS)
############################
# YOUR IMPLEMENTATION HERE #
error = 1
i = 0
# While error in value function is greater than 1
while error > tol and i < 100:
new_value_function = np.zeros(nS)
# For each state
for i in range(nS):
# Get policy for that state
a = policy[i]
# With this policy, get next state
# probability, nextState, reward, terminal = P[i][a]
# value_function
# Find all possible transitions, rewards, etc.
transitions = P[i][a]
for transition in transitions:
prob, nextS, reward, term = transition
# Calculated update value function
new_value_function[i] += prob*(reward + gamma*value_function[nextS])
error = np.max(np.abs(new_value_function - value_function)) # Find greatest difference in new and old value function
# print(new_value_function)
# print("error: {}".format(error))
value_function = new_value_function
i+=1
if i >= 100:
print("Policy evaluation never converged. Exiting code.")
exit()
############################
return value_function
```

```
def policy_improvement(P, nS, nA, value_from_policy, policy, gamma=0.9):
new_policy = np.zeros(nS, dtype='int')
############################
# YOUR IMPLEMENTATION HERE #
# For each state
for state in range(nS):
# Get optimal action
# If ties for optimal exist, choose random
Qs = np.zeros(nA)
# For each action
for a in range(nA):
# All possible next states from this state-action pair
transitions = P[state][a]
for transition in transitions:
prob, nextS, reward, term = transition
Qs[a] += prob*(reward + gamma*value_from_policy[nextS])
# For this state
# get maximum Q
max_as = np.where(Qs==Qs.max())
max_as = max_as[0]
# Set new policy to this action that maximizes Q
new_policy[state] = max_as[0]
############################
return new_policy
```

When you run this code, each iteration will produce a policy and value function that converges to the optimum. Below is a visualization of how the value function evolves with each iteration:

Again, information about the goal (lower-right corner) propagates back to all the other states with each iteration.

The optimal policy is one that essentially follows increasing rewards. In the starter code you always begin at the top-left corner of the frozen, so from there the optimal policy corresponds to always moving towards light-colored squares. After policy iteration converges, you should see a sample run of your program that successfully navigates from the start to finish.

# Frozen Lake: Value Iteration

The code for value iteration is similar to policy iteration. First, here is a look inside a completed value_iteration function:

```
def value_iteration(P, nS, nA, gamma=0.9, tol=1e-3):
value_function = np.zeros(nS)
policy = np.zeros(nS, dtype=int)
############################
# YOUR IMPLEMENTATION HERE #
# Value iteration is like policy iteration above, except estimation of the value function is done by maximizing over actions
# After the value function converges, one step is done that find the action that maximizes reward
error = 1
# Iterate value function, find optimal
while error > tol:
new_value_function = np.zeros(nS)
for s in range(nS):
Qs = np.zeros(nA)
for a in range(nA):
transitions = P[s][a]
for transition in transitions:
prob, nextS, reward, term = transition
Qs[a] += prob*(reward + gamma*value_function[nextS])
new_value_function[s] = max(Qs)
diff_vf = new_value_function-value_function
value_function = new_value_function
error = np.linalg.norm(diff_vf)
# Get policy from value function
for s in range(nS):
Qs = np.zeros(nA)
for a in range(nA):
transitions = P[s][a]
for transition in transitions:
prob, nextS, reward, term = transition
Qs[a] += prob*(reward + gamma*value_function[nextS])
max_as = np.where(Qs==Qs.max())
max_as = max_as[0]
policy[s] = max_as[0]
############################
return value_function, policy
```

Value iteration first finds the optimal value function, and then uses this to find optimal policy. Below shows the evolution of value function for this value iteration algorithm, showing similar behavior to policy iteration above:

Notice how this value function is the same as found through policy iteration, showcasing how the optimal value is *unique*.

# Discussion and Conclusion

In this post, we talked about the basics of reinforcement learning, discussed policy iteration and value iteration as fundamental algorithms in reinforcement learning, and showed a concrete example of reinforcement learning using the popular OpenAI python package.

When I look at the above gifs, I am reminded of the firegrass path-planning algorithm. This is a graph-based approach that seeks to find an optimal path from any point on the graph to an end point, all while avoid certain nodes (e.g. “holes” in a frozen lake) on the graph. In grassfire, information about the goal is propagated to adjacent nodes in an iterative fashion, just like RL. Why not use something like grassfire to find the best path through the frozen lake?

One big difference is that RL can handle stochastic environments. Although this blog post has largely focused on deterministic systems, everything we’ve talked about can easily be generalized where uncertainty plays a big part. For example, the start code also allows you to train a policy on a stochastic process. Frozen lakes are slippery, right? In the stochastic model, taking an action up, down, left or right will only result in that action occurring with *some probability*. For example, if you choose to go down, there is a 0.33 probability that you actually go down, but also a 0.33 probability that you move left and a 0.33 probability that you move right. For a 4x4 slippery lake, value iteration finds the following value function:

One thing I wish to discuss is where reinforcement learning goes from here. Grid-worlds are nice for learning, but how is RL applied to real-world examples? The first you might encounter is the fact that we don’t have a good model $p(s’|s,a)$ for real-world systems. Instead of spending time pain-stakingly finding this model for all new systems you encounter, how can a RL algorithm find the optimal policy and value function without a model? Obviously, the above algorithms can no longer be used. Instead, a lot of research has been done on model-free RL.

Another problem that has been researched is the tendency for state-space dimensions to be **huge**. For example, suppose you wish to train an RL algorithm on image input. A small image is 480x640 pixels in size, and each pixel can take up to to 255x255x255 values in color. Thus there are a possible 480x640x255x255x255 = 5,093,798,400,000 states! Tackling these huge state-space dimensions are another research direction, and so far a lot of success has been found with neural networks.