Jekyll2021-07-08T18:58:12-07:00https://jacobhiggins.github.io/feed.xmlJacob HigginsMotion Planning/Control Engineer for Autonomous SystemsJacob Higginsjdh4je@virginia.eduDropping a pencil on the floor2021-06-30T00:00:00-07:002021-06-30T00:00:00-07:00https://jacobhiggins.github.io/posts/2021/06/30/blog-post<h3 id="estimating-a-fundamental-constant">Estimating a fundamental constant</h3> <p>One of the</p>Jacob Higginsjdh4je@virginia.eduEstimating a fundamental constantIndirect Optimization Techniques2021-04-30T00:00:00-07:002021-04-30T00:00:00-07:00https://jacobhiggins.github.io/posts/2021/04/30/blog-post<h3 id="numerically-solving-control-problems">Numerically Solving Control Problems</h3> <p>Optimal control theory gives a framework for finding control policies that optimize some sort of numerical objective function. This framework consists of rules that define differential equations and boundary conditions that the optimal trajectory must obey. Finding these rules can be challenging for complicated systems, but that is only half the battle – actually finding solutions to these differential equations can be even more challenging. Analytical solutions are usually possible only with basic examples. For real-world problems, numerical solvers must be used to find these control policies.</p> <p>This blog post details my Matlab implementation of several numerical optimization techniques. Most of these techniques involve the same steps:</p> <ol> <li>Make an initial guess of the optimal controls/trajectory</li> <li>Find out how “wrong” our guess is</li> <li>Make corrections to the optimal control/trajectory, making a new guess</li> <li>Repeat steps 2 and 3 until our guess converges to correct solution</li> </ol> <p>There are several strategies when approaching optimal control problems. In this blog post, I will talk about <em>indirect</em> optimization techniques. This name is meant to distinguish between <em>direct</em> optimization techniques, but what is the difference? Although this will probably be a future blog post itself, the quick answer is what choices the engineer makes in terms of decision variables and necessary conditions for optimality. Indirect optimization chooses a more analytical technique, where the optimality conditions are found using calculus of variations. Direct optimization can be thought of a more brute-force approach, where the optimal trajectory is broken up into discrete steps, and the objective function is “directly” optimized using these discrete controls. Indirect methods are a more historical approach to control problems, since if you don’t have a computer, you might still be able to discover some fundamental properties of the solution. Thus, it is still a good exercise to understand these methods.</p> <h2 id="minimizing-cost-over-trajectories">Minimizing Cost Over Trajectories</h2> <p>The foundation for indirect methods is the calculus a variations, where the problem is often to solve what function optimizes a functional (or a function of a function). For example, a common functional is the integral of a function, $J(f) = \int f(t) dt$. Calculus of variations finds the function of $t$ that minimizes this integral $J$, which is a functional of $f$. For example, if $J(x) = \int (x(t))^2 dt$, then the function that minimizes this integral is $x(t)=0$, since any non-zero value for the function would result in a positive value for the integral.</p> <p>The calculus of variations approach to solving this problem analytically is surprisingly close to regular calculus: if trajectory $x(t)$ results in a functional value of $J(x)$, then perturbing this trajectory by a little bit, $x(t)+\delta x(t)$, generally results in a perturbed functional value, $J(x) + \delta J(x)$. The goal is to find a function $x(t)$ such that any small perturbation $\delta x(t)$ results $\delta J(x) = 0$. This is similar to optimization using calculus, where the goal is to find the value where the slope is equal to zero.</p> <p>Suppose a system is governed by $\dot{x}=a(x(t),u(t))$, where $x(t)$ is the state of the system overtime and $u(t)$ is the control variable, i.e., what we are trying to solve for. Assuming initial condition $x(0)=x_0$ and time interval $t\in [0,T]$, a common control cost function to minimize usually has the form of $J(x) = h(x) + \int_0^Tg(x,u)dt$. It is often helpful to define an auxillary function, $H$, called the Hamiltonian of the system:</p> <p>\begin{equation} H = g(x,u) + p(t)a(x,u) \end{equation}</p> <p>The function $p(t)$ is called a <em>costate</em>, and is directly connected to the state $x(t)$. The reason for it being here in our problem is connected to another common optimization procedure, <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multipliers</a>. Lagrange multipliers are needed when optimization over must be constrained by a given function (or set of functions). In our case, the constraining function is $\dot{x}=a(x(t),u(t))$, which constrains the velocity of $x(t)$ to the current value of $x(t)$ and $u(t)$.</p> <p>Calculus of variations gives the following conditions as necessary for optimal control inputs $u(t)$, assuming $x(T)$ is free to be any value:</p> <p>\begin{equation} \label{eq:xEOM} \dot{x} = \frac{\partial H}{\partial p} = a(x,u) \end{equation}</p> <p>\begin{equation} \label{eq:pEOM} \dot{p} = -\frac{\partial H}{\partial x} = -\frac{\partial a}{\partial x}p - \frac{\partial g}{\partial x} \end{equation}</p> <p>\begin{equation} \label{eq:minU} 0 = \frac{\partial H}{\partial u} \end{equation}</p> <p>\begin{equation} \label{eq:x0} x(0) = x_0 \end{equation}</p> <p>\begin{equation} \label{eq:final} p(T) = \frac{\partial h}{\partial x}(x(T)) \end{equation}</p> <p>Notice how Eq. \ref{eq:xEOM}-\ref{eq:minU} are all found by taking the partial derivative of the Hamiltonian with respect to the different variables involve, $x$, $p$ and $u$. If the controls are non-differentiable, e.g., if admissible controls are constrained by upper and lower values, then Eq. \ref{eq:minU} becomes $H(x,p,u^*)\le H(x,p,u)$, also known as Pontryagin’s minimum principle. Eq. \ref{eq:x0} is an initial condition, and Eq. \ref{eq:final} is a boundary condition at the end of the trajectory (when $t=T$).</p> <p>Together, these equations provide necessary (but not sufficient) conditions for an optimal trajectory. The problem is to find solutions for $x(t)$, $p(t)$ and $u(t)$ that satisfies all the above conditions simultaneously. For very simple problems, this is tractable, but for more complex problems a numerical approach is needed to find these solutions.</p> <h2 id="numerically-solving-odes">Numerically solving ODE’s</h2> <p>The above necessary conditions means we need to solve a system of ODE’s. In general, this is made tricky for two reasons:</p> <ol> <li>In general, the equations of motion $\dot{x}=\frac{\delta H}{\delta p}$ and $\dot{p}=-\frac{\delta H}{\delta x}$ are nonlinear functions.</li> <li>Eq. \ref{eq:x0} and \ref{eq:final} present split boundary conditions, meaning we have partial information about the trajectory at $t=0$ and at $t=T$, but not complete information at either boundary.</li> </ol> <p>If the equations of motion were linear, or if the boundary conditions weren’t split, then finding the solutions to this problem would actually be quite easy. Unfortunately, this isn’t the case in general, so more complicated schemes must be developed to find the solution. These usually involve finding a trajectory where only some of the ODE’s or boundary conditions are satisfied, then iteratively changing the trajectory to find a solution where all conditions are satisfied.</p> <h3 id="approach-1-gradient-descent">Approach 1: Gradient descent</h3> <p>In this approach, we initially find trajectories that satisfy Eq. \ref{eq:xEOM}, \ref{eq:pEOM}, \ref{eq:x0}, and \ref{eq:final}. This leaves \ref{eq:minU}, or $\frac{\delta H}{\delta u}=0$, to be satisfied. It can be shown that if the control $u(t)$ is varied by $\delta u$ with the other equations satisfied, then variation in the cost $\delta J$ is:</p> <p>\begin{equation} \delta J = \int_{t_0}^{T} \left(\frac{\partial H}{\partial u} \delta u \right) dt \end{equation}</p> <p>If we choose $\delta u = -\tau \frac{\partial H}{\partial u}$, where $\tau$ is some positive constant, then the variation becomes:</p> <p>\begin{equation} \delta J = - \tau \int_{t_0}^{T} \left(\frac{\partial H}{\partial u}\right)^2 dt \end{equation}</p> <p>In other words, this particular choice in $\delta u$ <em>guarantees that the cost decreases over each iteration</em>, provided that $\tau$ is small enough. This is similar to many machine learning algorithms, where some positive definite cost (e.g., the average sum of square errors) is reduced by gradient descent. In this analogy, the constant $\tau$ would be considered the learning rate.</p> <p>The workflow for this algorithm is:</p> <ol> <li>With initial guess $x_0$, $p_0$ and $u_0$, first integrate the state $x(t)$ forward from $x_0$ to $x(T)$.</li> <li>Using the boundary condition of Eq. \ref{eq:final}, determine $p(T)$.</li> <li>Integrate backwards in time from $p(T)$ to $p_0$ (overwriting the initial guess of $p_0$) to find the trajectory $p(t)$. Now we have trajectories for $x(t)$ and $p(t)$ that satisfy \ref{eq:xEOM}, \ref{eq:pEOM}, \ref{eq:x0}, and \ref{eq:final}, but not Eq. \ref{eq:minU}.</li> <li>Evaluate $\frac{\partial H}{\partial u}$ at each discrete point in time, indexed by $k$. Update the applied control at this point by $u(k) := u(k) - \tau\frac{\partial H}{\partial u}$.</li> <li>If $\int_{t_0}^{T} \left(\frac{\partial H}{\partial u}\right)^2$ is below some positive threshold, then terminate the algorithm. If not, repeat steps 2, 3 and 4.</li> </ol> <p>This algorithm was tested on a simple double-integrator system with $\ddot{x}=u$. The system started at $x_0=0.0$ and was tasked to track a constant reference value $x_r$. In order to track the reference, the cost function was defined with $g(x,u) = Q_0(x-x_r)^2 + Q_1(\dot{x})^2 + Ru^2$ and $h(x)=Q_h(x-x_r)^2$. The constants $Q_0$, $Q_1$, $Q_h$, and $R$ were tuned to get reasonable results.</p> <p>Below shows how gradient descent iterations converges onto the optimal solution:</p> <p align="center"> <img width="460" height="300" src="\images\blog_pics\2021\IndirectOptimization\graddescent.gif" /> </p> <h3 id="approach-2-variation-of-extremals">Approach 2: Variation of Extremals</h3> <p>This approach first assumes that Eq. \ref{eq:minU} can be rearranged to solve for $u^*$ in terms of $x$ and $p$. This means that the state dynamics of $x$ and $p$ can be solved entirely in terms of both variables:</p> <p>\begin{equation} \label{eq:xEOM_implicit} \dot{x} = \frac{\partial H}{\partial p} = a(x,p) \end{equation}</p> <p>\begin{equation} \label{eq:pEOM_implicit} \dot{p} = -\frac{\partial H}{\partial x} = d(x,p) \end{equation}</p> <p>Thus, the problem now is to find $x(t)$ and $p(t)$ that satisfy these two differential equations, as well as the two end conditions Eq. \ref{eq:x0} and Eq. \ref{eq:final}. The variation of extremals method then seeks to make a guess at $p_0$ such that the boundary condition Eq. \ref{eq:final} is satisfied. At first, this initial guess will almost certainly be incorrect. Suppose that after integrating Eq. \ref{eq:xEOM_implicit} and Eq. \ref{eq:pEOM_implicit} to find $x(t)$ and $p(t)$ with $x(0)=x_0$ and $p(0)=p_0$, we find that $p(T)\neq \frac{\partial h}{\partial x}_{t=T}$ (i.e., Eq. \ref{eq:final} is not satisfied). The variation of extremals asks: how should we changed the value of $p_0$ such that $p(T) = \frac{\partial h}{\partial x}_{t=T}$? There is a complicated, nonlinear relationship between $p_0$ and $p(T)$, so finding an answer to this question is almost impossible. But we can ask a slightly more general question: if we increase/decrease $p_0$ by a small amount, will the difference $\text(abs)\left(p(T) - \frac{\partial h}{\partial x}_{t=T}\right)$ become bigger or smaller?</p> <p>In a certain sense, there is a functional relationship between $p(T)$ and $p_0$, where each value of $p_0$ must map to a single value of $p(T)$. Suppose this function is denoted by $F$ so that $p(T) = F(p_0)$. We want an approach where we can iterations of guess for $p_0$ can coverge to a value such that $F(p_0) = \frac{\partial h}{\partial x}_{t=T}$, or to put another way, $F(p_0) - \frac{\partial h}{\partial x}_{t=T} = 0$.</p> <p>There is such an approach, called <a href="https://en.wikipedia.org/wiki/Newton%27s_method">Newton’s Method</a>, that does exactly this. Given an initial guess $p_0^0$ of a function’s root, the function evaluated at this point, $F(p_0^0)$, and the derivative evaluated at the same point, $F’(p_0^0)$, then Newton’s method gives you a better approximation of the function’s root, $p_0^1$. The relationship is straightforward:</p> <p>\begin{equation} \label{eq:NewtonMethod} p_0^1 = p_0^0 - \frac{F(p_0^0)}{F’(p_0^0)} \end{equation}</p> <p>Eq. \ref{eq:NewtonMethod} can be repeated $k$ times to find $p_0^k$. The larger $k$ is, the better the approximation.</p> <p>Finding both $F(p_0)$ and $F’(p_0)$ are fairly simple in theory: use $p_0$ as the initial costate value and integrate the equations of motions forward in time to find $p(T)$ and calculate $p(T) - \frac{\partial h}{\partial x}_{t=T}$. With some overloading of notation, let’s define this as $F(p_0)$. Finding $F’(p_0)$ would then entail using $p_0 \pm \epsilon$ as the initial costate values and integrating forward to get $F(p_0\pm\epsilon)$. This two values can be used to estimate the derivative: $F’(p_0)\approx \left( F(p_0+\epsilon)-F(p_0-\epsilon) \right)/(2\epsilon)$.</p> <p>It turns out the above procedure is okay if $p_0$ has small dimensionality, but very inefficient if $p_0$ has larger dimensionality. Think about it: if the costate has a dimension of 5, then we must figure out how each of the 5 initial values in $p_0$ effect the 5 final values in $p(T)$, for a total of 25 derivatives that must be calculated each iteration. The naive approach for approximating $F’(p_0)$ will be very slow. Luckily, there is a more efficient of computing these derivatives all at once, instead of one-by-one. I will not detail the approach here, but it can be found in Chapter 6.3 of <a href="https://www.amazon.com/Optimal-Control-Theory-Introduction-Engineering/dp/0486434842">Kirk’s Optimal Control Theory</a> (a free pdf of the book is also one google search away).</p> <p>Here is the workflow for this numerical optimal control algorithm:</p> <ol> <li>Use Eq. \ref{eq:minU} to find optimal control $u(x,p)$, and plug into equations of motion to find Eq. \ref{eq:xEOM_implicit} and Eq. \ref{eq:pEOM_implicit}.</li> <li>Initialize $x_0$ and $p_0$.</li> <li>Integrate forward in time to find final value of the costate $p(T)$. Also find how the value of $p(T)$ changes due to small changes in the initial value of the costate, $p_0$. This can be used to find the derivate of $p(T)$ with respect to $p_0$, needed for Netwon’s method.</li> <li>Use Eq. \ref{eq:NewtonMethod} to update $p_0$ so that $p(T) - \frac{\partial h}{\partial x}_{t=T} \rightarrow 0$ after each iteration.</li> <li>If $p(T) - \frac{\partial h}{\partial x}_{t=T}$ is sufficiently close to zero, terminate the algorithm. Otherwise, return to step 3.</li> </ol> <p>Below, I apply the variation of extremals method to the same double-integrator as before, with the same initialization for both state and costate trajectory.</p> <p align="center"> <img width="460" height="300" src="\images\blog_pics\2021\IndirectOptimization\extremals.gif" /> </p> <h3 id="about-the-two-approaches">About the Two Approaches</h3> <p>Some small discussion is warranted about the pros and cons of the two approaches detailed above.</p> <p>Regarding the method of gradient descent, this method is guaranteed to work no matter how the trajectories are initialized. This makes it quite robust to bad initial guesses, which is helpful when the control system is complex enough such that a good initial guess is impossible without a lot of work before hand. All you need to do is make sure the learning rate $\tau$ is small enough so the gradient steps don’t “jump” over the minimum. The downside, however, is that convergence can be very slow to approach the absolute minimum. The gradient is always steepest when you first start the optimization process, but after each iteration, the gradient reduces in magnitude. Another way to label it might be diminishing returns: the same work to bring the cost down from 100 to 10 might be the same amount of work to bring the cost down from 10 to 1. This kind of problem is also present in machine learning, where the regression variables or neural networks weights are changed using gradient descent. Various methods, such as <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam">Adam optimization</a>, are used to combat the slow nature of gradient descent as it converges to the optimum.</p> <p>On the other hand, the variation of extremals can converge quite quickly to the optimum, as it is capable of taking quite large steps without missing the optimum, unlike gradient descent. The only issue is that convergence is highly dependent on the initial guess for $p_0$. If this guess is off by some amount, then the variation of extremals with likely <em>diverge</em>. What makes it even more tricky is that it is difficult to tell how “close” you need to be in order to get fast convergence. Indeed, in the above gif where I’m not remotely close to the optimal trajectory initially, convergence is achieved because Newton’s method (and hence the variation of extremals) has the nice property that it always converges for linear systems, e.g., the double integrator.</p> <p>When solving problem numerically, you can also mix-and-match approaches in an attempt to leverage the benefits of both. For example, you could use gradient descent initially to find a good approximate solution, then use variation of extremals to converge quickly on the optimal trajectory.</p> <h2 id="practical-example-cart-pole">Practical Example: Cart-pole</h2> <p>I want to end this blog post with an application of numerical techniques to solve for a well known system: the cart-pole. This system is well known in control because it is relatively simple to understand qualitatively, but exhibits highly nonlinear behavior in the equations of motion. The goal for this application is to use numerical techniques to solve for a control that will push the cart-pole from its stable equilibrium (the pole pointing down) to its unstable equilibrium (the pole pointing up).</p> <p>For this approach, I use gradient descent to initially find the optimal controls. Then, I use variation of extremals to help achieve convergence faster. The state equations of motion I actually derived in another <a href="https://jacobhiggins.github.io/posts/2020/08/blog-post-1/">blog post</a>, if you want to check that out.</p> <p>Below is a gif of the cart-pole in action:</p> <p align="center"> <img width="460" height="300" src="\images\blog_pics\2021\IndirectOptimization\inverted_pend.gif" /> </p> <p>As you can see, the cart-pole doesn’t quite settle onto the unstable equilibrium, but it does a pretty good job of getting the pole up there. Since this is my first time coding something like this, I consider it more of a success than a failure.</p> <p>Some things I discovered while applying these numerical techniques to solving controls for the cart-pole system:</p> <ul> <li>Tuning the parameters proved to be quite tricky. For example, my cost function had the square error between the current angle of the cart-pole and the desired angle (which was straight up, i.e, $\theta=\pi/2$). If I found that the trajectory had trouble getting exactly to this reference value for the angle, one sensible thing to do is increase the weight inside the cost function associated with the angular reference. When I did this, though, I sometimes found that suddenly gradient descent wouldn’t converge! I eventually discovered that when I increased the weight inside the cost function, I also increase the magnitude of the gradient of that cost function. To compensate, then, I decreased the learning rate $\tau$, but this would lead to slower convergence.</li> <li>It seemed like the OCP solver would take the amount of time I gave it in order to fully swing up. For instance, I originally had the OCP solver find trajectories over 3 seconds. I found that the pole <em>almost</em> made it to $\pi/2$, but not quite. So, one thing I tried was to increase the time of the trajectories to 3.5 seconds in length; perhaps the cart-pole simply needed for time. Increasing the time, however, never seemed to really help. No matter the length of time I set the trajectories to last, the pole would only ever make it close to $\pi/2$ at the very end of the trajectory, whether that be 3 seconds, 4 seconds, or 5 seconds. You would like that it would try to reach $\pi/2$ at the earliest time in order to minimize cost. Perhaps gradient descent converged onto local minima, rather than global minima. Or perhaps by increasing the time, the optimal solution was to have a smaller control effort over a larger time horizon.</li> </ul> <h2 id="conclusions">Conclusions</h2> <p>While it is a little disappointing that the cart-pole couldn’t stabilize itself upright, I’m still pretty happy with the overall results. One thing that happens in real life demos of the cart-pole is that while optimal control is used to “swing up” the pole, a PID controller is used when the system is close to its unstable equilibrium. Since the PID is designed to operate around this unstable equilibrium, this two-controller approach works pretty well overall. That is an obvious next for this project, if I wish to return to it in the future.</p> <p>Overall, I had a lot of fun learning about these indirect optimization techniques, and especially seeing them actually work on the cart-pole system.</p>Jacob Higginsjdh4je@virginia.eduNumerically Solving Control ProblemsEKF SLAM Demo2021-02-20T00:00:00-08:002021-02-20T00:00:00-08:00https://jacobhiggins.github.io/posts/2021/02/20/blog-post<h3 id="chicken-or-the-egg">Chicken or the egg?</h3> <p>In robotics, it is common to use a map of the environment as a tool for navigation and motion planning. Specifically, maps are really good at telling you where you are in the environment: this called localization. The ability to localize, though, is completely dependent on the quality of the map. Humans could construct this map, but you can imagine how much work that would require to measure out every detail. If the map is off by a little bit, how long will it take to correct that error and improve precision? What if something changes and you need to update the map? Since robots will use the map anyway, there is this idea of use the sensing modality of the robot to build the map: this is called mapping.</p> <p>One issue that you may see is that in a realistic situation, there seems to be some conflicting workflow: in order to create a map, the robot needs a precise estimate on its current location (how can you tell the size of a room if you aren’t confident in how far you walked from one end to the other?). But in order to get a precise estimate on the current location, you must have a map to help you localize. In this sense, <em>S</em>imultaneous <em>L</em>oclalization <em>a</em>nd <em>M</em>apping (SLAM) is often described as a chicken-and-the-egg problem, where the solution of one process requires the end result to the other. How do you solve such a problem?</p> <p>The answer is not too complicated: you must solve both problems at the same time – hence the world “simultaneous”. The idea is that separately, mapping and localization are pretty much the same process of estimating location given some process within some errors. In localization, you estimate your current position by knowing the exact position of landmarks on the map, and when mapping, you estimate the location of landmarks given a precise estimate of you current location. SLAM leans into the mathy jungle of Baysean estimation by recognizing that since the estimation process is the same, you can effectively lump all the positions that you want to estimate (robot and landmarks) into the same vector. By iteratively finding estimates for this vector, you solve both localization and mapping at the same time.</p> <p>This blog post will focus on the most conceptually simple type of SLAM, called EKF SLAM. If you are not aware, the <a href="https://en.wikipedia.org/wiki/Extended_Kalman_filter">EKF (Extended Kalman Filter)</a> is way of estimating values of any variable that (1) follows a differential equation of motion through time, allowing you to predict the current value from previous values, and (2) is periodically observed, allowing you to correct you’re prediction with measurements. This post will not be an explanation of how EKF SLAM works, per se, as there are many works that already do this quite well (<a href="https://www.youtube.com/watch?v=X30sEgIws0g">here</a> is a video by Cyrill Stanchniss, whose youtube channel is a rich resource for any aspiring roboticist). Instead, I’ll will explore some questions that are often left unexplored when EKF SLAM is talked about, as well as show you my personal implementation of the 2D EKF SLAM for robots with an odometry motion model.</p> <h3 id="the-1d-slam-problem">The 1D SLAM Problem</h3> <p>First, let’s look at the simple 1D case, i.e., robot movement on a straight line. It would look funny if the landmarks were on the robot’s line of motion, so without loss of generality I’ll put the landmarks and the robot on different y positions, and allow the robot to change its x position by moving right to left. Below is a picture of the setup with two landmarks:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2021/EKF_SLAM/1D_setup.png" /> </p> <p>Since the robot can only move left and right, the motion model of the robot is quite simple. Here is the discrete motion dynamics that describe how the x position of the robot $x_r$ changes under commanded velocity $u$ over a time step of $dt$:</p> <p>\begin{equation}\label{eq:motion_model} x_r^+ = x_r + u\times dt \end{equation}</p> <p>The robot can only observe a landmark if it is within some sensing radius. Because this is a 1D simulation, I am only concerned with the x position of landmark $i$, denoted as $x_{li}$. When a landmark is observed, the robot observes the relative distances between it and the landmark:</p> <p>\begin{equation}\label{eq:meas_model} z_{li} = x_{li} - x_r \end{equation}</p> <p>Both the robotic motion \ref{eq:motion_model} and the measurement model \ref{eq:meas_model} are linear, so the EKF performs the same as the regular KF in this situation. This won’t be the case for the 2D EKF SLAM in the next section, but for now the linearity keeps the problem simple enough to see the concepts at play without too much complication.</p> <p>In this simplified SLAM problem, we wish to estimate the x positions of the robot and the two landmarks, so three variables in total. In the EKF algorithm, our belief takes the form of a multivariate gaussian. This gaussian is defined by its average value $\mathbf{\mu}$, where we actually estimate the x positions to be, and the covariance matrix $\Sigma$, which defines the errors in this belief (along the diagonal) as well as the correlations between the x positions (on the off-diagonal elements). In general, these are defined as:</p> <p>\begin{equation} \mathbf{\mu} = \begin{pmatrix} \mu_{rx} \\<br /> \mu_{1x} \\<br /> \mu_{2x} \end{pmatrix} \end{equation}</p> <p>\begin{equation} \mathbf{\Sigma} = \begin{pmatrix} \Sigma_{rr} &amp; \Sigma_{r1} &amp; \Sigma_{r2} \\<br /> \Sigma_{1r} &amp; \Sigma_{11} &amp; \Sigma_{12} \\<br /> \Sigma_{2r} &amp; \Sigma_{21} &amp; \Sigma_{22} \end{pmatrix} \end{equation}</p> <p>To be clear, $\Sigma_{ri}$ is the covariance between the robot’s x position and the $i$th landmark’s x position.</p> <p>Initializing the EKF SLAM process is pretty simple. First, we provide a estimate on the locations of the robot and landmarks, $\mathbf{\mu}$. As is the case with regular EKF estimation, these starting estimates are not so critical. I chose $\mathbf{\mu} = \begin{pmatrix} 0 &amp; 0 &amp; 0 \end{pmatrix}^T$. There is small but important point when choosing $\Sigma$: since SLAM is concerned with building a map of the environment, the origin can be defined however we choose. We can take advantage of this choice by saying that the origin is <em>precisely where the robot starts</em>. In terms of the mutlivariate gaussian, $\Sigma_{rr}=0$. The other landmarks are initialized with a large error, say $\Sigma_{ii}=100$.</p> <p>Below is a gif showing how the estimates and uncertainties change over time in the simulation. A line is drawn between the robot and any landmark that it is close enough to observe.</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2021/EKF_SLAM/1D_EKF.gif" /> </p> <p>Here are some observations that one can make:</p> <ul> <li>When the robot moves without sensing a landmark, it’s uncertainty grows over time.</li> <li>Unless observed by the robot, the uncertainty of the landmarks <em>do not</em> change over time.</li> <li>When the robot sees a landmark for the first time, the uncertainty of the landmark is the uncertainty of the robot at that moment. At the same time, the uncertainty of the robot doesn’t really change.</li> </ul> <p>There are two “magic” things that happen in EKF SLAM. The first happens when the robot sees a landmark it has already seen before. Since the uncertainty of the landmarks don’t grow over time, sensing the landmark for a second time helps reduce the robot’s own uncertainty. It is unlikely that this exchange will reduce the uncertainty of the landmark, since the robot’s uncertainty only grows as it moves without observing any landmarks.</p> <p>The second bit EKF SLAM magic is even more interesting. In the above gif, you may notice that the uncertainty of the landmark 2 is fairly large when first observed. This makes sense, since the robot itself has a large uncertainty. Continue watching, and you’ll see that the uncertainty of landmark 2 decreases when landmark 1 is observed for a second time! This is strange at first: why should the uncertainty of one landmark be reduced when another landmark is observed? Further thought reveals the intuition: seeing a landmark for a second time helps contextualize the locations of the other landmarks you observed in the past. Imagine walking around an unfamiliar neighborhood; seeing a landmark for a second time not only helps you understand where you are, but also helps place the landmarks that you recently walked past. Since you typically see landmarks again only after travelling around a complete loop, this process is called <em>closing the loop</em>.</p> <p>In the parlance of Gaussian shapes, we say that the locations of the landmarks are <em>correlated</em>. These are the off-diagonal elements of $\Sigma$ (technically the off-diagonal elements are called covariances, but they are strongly related to correlation). Initially, the locations of landmarks are uncorrelated in the EKF, i.e., $\Sigma$ is a diagonal matrix. When landmarks are first observed, the off-diagonal entries of $\Sigma$ are filled by the EKF algorithm. It’s these off-diagonal correlations that reduce the uncertainty of landmarks when other landmarks are observed.</p> <p>The correlations between robot location and landmark location aren’t shown in the above figure, since the correlation isn’t as straight forward to visualize. To help visualize how these correlations are built up, below the simulation along with the covariance matrix $\Sigma$ overtime.</p> <p align="center"> <img width="750" height="400" src="/images/blog_pics/2021/EKF_SLAM/1D_EKF_wSigma.gif" /> </p> <h3 id="the-2d-slam-problem">The 2D SLAM Problem</h3> <p>A more practical example is when the robot can move in 2 dimensions. For this simulation, I’ll use the <a href="http://www.cs.columbia.edu/~allen/F17/NOTES/icckinematics.pdf">differential drive kinematic model</a>. Basically, the robot has three state variables $(x_r,y_r,\theta_r)$ and two inputs, linear velocity $v$ and rotational velocity $\omega$. The continuous time equations of motion are:</p> <p>\begin{equation*} \dot{x_r} = v\cos(\theta_r) \end{equation*} \begin{equation*} \dot{y_r} = v\sin(\theta_r) \end{equation*} \begin{equation*} \dot{\theta_r } = \omega \end{equation*}</p> <p>This motion model is commonly used to teach EKF SLAM because (1) it is a simple model that can be applied to a lot of practical situations and (2) it is nonlinear, meaning we need to use the “Extended” in EKF. In order to setup the EKF, matrices need to be constructed that represent the locally linear kinematics around any generic operating point. Thrun’s chapter on EKF SLAM has derived these matrices for us, so we’ll use those. When the robot observes the landmarks in the 2D case, it observes the distance away to the landmark $r_i$ and the angle of the landmark relative to its heading $\theta_i$. These are also nonlinear:</p> <p>\begin{equation*} r_i = \sqrt{(x_r-x_i)^2 + (y_r-y_i)^2} \end{equation*} \begin{equation*} \theta_i = \arctan{((y_i-y_r)/(x_i-x_r))} - \theta_r \end{equation*}</p> <p>As is the case with the nonlinear motion model, EKF SLAM requires matrices that linearize around some generic operating point. These are also derived in Thrun’s book.</p> <p>A run of the 2D simulation is shown below. The same observations that were made in the 1D case can also be made in the 2D case as well.</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2021/EKF_SLAM/ekf_slam_2D.gif" /> </p> <p>Below is the same run with the covariance matrix $\Sigma$ alongside for comparison with the 1D case. When the loop is closed, you can tell a clear checker board pattern emerge in $\Sigma$. This is due to the fact that the x coordinates of each landmark are correlated together, and the same with each y coordinate, but correlations between x and y don’t exist. This makes sense, since these are orthogonal directions and knowing the x coordinate of a landmark tells me nothing of the y coordinate.</p> <p>Note: when looking at the colors in $\Sigma$, also look at the color bar on the right to get a sense of scale. Until landmark 4 is observed, the largest value in $\Sigma$ is 100, so the nonzero terms that seem to “appear” after landmark 4 is observed are always there, the changing colors is just the graph adjusting to a much smaller scale. The same also goes for colors that seem to “fade”, where the scale might be growing to accommodate the growing uncertainty in the robot x/y position.</p> <p align="center"> <img width="750" height="400" src="/images/blog_pics/2021/EKF_SLAM/ekf_slam_2D_wSigma.gif" /> </p> <h3 id="conclusions">Conclusions</h3> <p>This blog post is intended to show the EKF in a little more detail than standard classes typically do, giving the big ideas a chance to breathe. In a future blog post, I hope to explore what it means when we close the loop. For anyone that wishes to understand the material better, I suggest trying to implement your own EKF SLAM algorithm. There are several points for practical implementation that I didn’t go over here, and are best learned when struggling to do this stuff yourself. But seeing SLAM work in the end is always worth the pain!</p> <!-- ### How do you fuse measurements together? The first time you learn about probabilistic robotics, you will probably hear about the [Kalman filter](https://en.wikipedia.org/wiki/Kalman_filter). The Kalman filter is a way of estimating the state of a system that has both process noise and measurement noise. Founded in probability theory, it gives an optimal estimate based on the relative size of the process and measurement noise. If the process noise is very large and the measurement noise is very small, then the Kalman filter returns an estimated state that is closer to the measurement, and vice versa. For most students that first encounter the Kalman filter, you're told the intuition, shown some complicated multivariate Gaussian math and are told to use it in a homework exercise. Most introductory examples of the Kalman filter have only one measurement to use. What if there are more than one, though? How do you handle two different measurements of the same exact thing with a Kalman filter? This is a form of sensor fusion, and when I first learned the Kalman filter I had only a shaky understanding of the solution. The purpose of this blog post is to show how the Kalman filter can perform sensor fusion, and hopefully clarify the machinery of the Kalman filter in the process. ### The two steps of the Kalman filter Let us define $p(\mathbf{x})$ as the belief probability distribution of state space vector $\mathbf{x}$. A Kalman filter has two important steps when providing an estimate of the value of $\mathbf{x}$: - Prediction update step (use input $\mathbf{u}$ to update $p(\mathbf{x})$) - Measurement update step (use measurement $\mathbf{y}$ to update $p(\mathbf{x})$) Since the Kalman filter assumes multivariate Gaussian probability distributions, only two quantities are recorded over each step $k$ in the Kalman filter: an estimate vector $\hat{\mathbf{x}}\_k$ (corresponding to the mean of the Gaussian), and a covariance matrix $P\_k$ (corresponding to the confidence of the measurement). Together, these quantities fully define the probability distribution $p(\mathbf{x})$, so when the Kalman filter updates $p(\mathbf{x})$ in the prediction and the measurement step, it really just updates these two variables. What if we have two separate measurements $\mathbf{y}^a$ and $\mathbf{y}^b$, modeled with their own Gaussian noise? Cutting to the punchline, you perform _two_ measurement steps, one with each measurement: - Prediction update step (use input $\mathbf{u}$ to update $p(\mathbf{x})$) - First measurement update step (use measurement $\mathbf{y}^a$ to update $p(\mathbf{x})$) - Second measurement update step (use measurement $\mathbf{y}^b$ to update $p(\mathbf{x})$) This seems like a natural thing to do given two different measurements. Let's see why this is so. ### The Bayes Update The Kalman filter is a specific type of filter called a Bayes filter. The Bayes filter also has two steps: one prediction step and one measurement step. In order to change the Kalman filter to incorporate more than one measurement step, we need to understand what each step means in terms of Bayesian estimation. Define the belief distribution at iteration $k-1$ as $p\_{k-1\|k-1}(\mathbf{x})$. The $k-1\|k-1$ part can be read as the probability distribution at time $k-1$, given information up to time $k-1$. The prediction step requires some process model that describes the probability of reaching state $\mathbf{x}^+$ given current state $\mathbf{x}$ and input $\mathbf{u}_{k-1}$: \begin{equation\*} p(\mathbf{x}^+\|\mathbf{x},\mathbf{u}_{k-1}) \end{equation\*} The prediction update is thus described in the Bayes filter as: \begin{equation} p\_{k\|k-1}(\mathbf{x}) = \sum\_{\mathbf{x}'} p(\mathbf{x}\|\mathbf{x}',\mathbf{u}_{k-1}) p\_{k-1\|k-1}(\mathbf{x}) \end{equation} At time step $k$, we then receive two measurements $\mathbf{y}^a_k$ and $\mathbf{y}^b_k$. The measurement update of the Bayes filter is then used to find the probability distribution of the state given these two measurements, or $p\_{k\|k}(\mathbf{x})=p(\mathbf{x}\|\mathbf{y}^a_k,\mathbf{y}^b_k)$. In the case of a single measurement, Bayes' rule is used to "switch" the random variable and the conditioned variable: \begin{equation\*} p(\mathbf{x}\|\mathbf{y}) = \eta p(\mathbf{y}\|\mathbf{x})p\_{k\|k-1}(\mathbf{x}) \end{equation\*} The $\eta$ term is a normalization factor, and we are usually unconcerned with it in almost all Bayes filters. The measurement process $p(\mathbf{y}^a\|\mathbf{x})$ includes gaussian noise in the Kalman filter, and the probability distribution $p_{k\|k-1}(\mathbf{x})$ is found in the prediction update. With two state measurements, this update is only slightly altered. First, perform Bayes' rule between the state variable $\mathbf{x}$ and a single measurement, say $\mathbf{y}^b$: \begin{equation\*} p(\mathbf{x}\|\mathbf{y}^a\_k,\mathbf{y}^b\_k) = \eta^b p(\mathbf{y}^b\_k\|\mathbf{x},\mathbf{y}^a\_k)p\_{k\|k-1}(\mathbf{x}\|\mathbf{y}^a\_k) \end{equation\*} We can simplify this expression by making the reasonable assumption that the measurement model for $\mathbf{y}^b_k$ is independent of (1) the iteration step $k$ and (2) the value of the other measurement $\mathbf{y}^a$, so that $p(\mathbf{y}^b\_k\|\mathbf{x},\mathbf{y}^a\_k) = p(\mathbf{y}^b\|\mathbf{x})$. This results in: \begin{equation} \label{eq:second_measurement} p(\mathbf{x}\|\mathbf{y}^a\_k,\mathbf{y}^b\_k) = \eta^b p(\mathbf{y}^b\|\mathbf{x})p\_{k\|k-1}(\mathbf{x}\|\mathbf{y}^a\_k) \end{equation} This is great, but we still need $p\_{k\|k-1}(\mathbf{x}\|\mathbf{y}^a\_k)$, or the probability distribution of $\mathbf{x}$ conditioned on the value for $\mathbf{y}^a_k$. This term can be found performing Bayes' rule a second time: \begin{equation} \label{eq:first_measurement} p(\mathbf{x}\|\mathbf{y}^a\_k) = \eta^a p(\mathbf{y}^a\|\mathbf{x})p\_{k\|k-1}(\mathbf{x}) \end{equation} Eq. \ref{eq:first_measurement} can be thought of as the first measurement update to find the a posteriori probability distribution with respect to measurement $\mathbf{y}^a$. We then take this updated distribution and perform a second measurement update with this update distribution and $\mathbf{y}^b$ and in Eq. \ref{eq:second_measurement}. The result is an estimate that "fuses" two different measurements from different sensor, verifying our intuition. ### One final note It should be that incorporating two different measurements should arrive at the same answer, independent of the order in which you incorporate the measurements. Indeed, you can see this by substituting Eq. \ref{eq:first_measurement} into Eq. \ref{eq:second_measurement}: \begin{equation} p(\mathbf{x}\|\mathbf{y}^a\_k,\mathbf{y}^b\_k) = \eta^a\eta^b p(\mathbf{y}^a\|\mathbf{x})p(\mathbf{y}^b\|\mathbf{x})p\_{k\|k-1}(\mathbf{x}) \end{equation} The measurement update is symmetric in the label for measurements $a$ and $b$, so it cannot depend on the order in which the measurement updates are applied. -->Jacob Higginsjdh4je@virginia.eduChicken or the egg?Sensor Fusion Using a Kalman Filter2020-12-31T00:00:00-08:002020-12-31T00:00:00-08:00https://jacobhiggins.github.io/posts/2020/12/31/blog-post<h3 id="how-do-you-fuse-measurements-together">How do you fuse measurements together?</h3> <p>The first time you learn about probabilistic robotics, you will probably hear about the <a href="https://en.wikipedia.org/wiki/Kalman_filter">Kalman filter</a>. The Kalman filter is a way of estimating the state of a system that has both process noise and measurement noise. Founded in probability theory, it gives an optimal estimate based on the relative size of the process and measurement noise. If the process noise is very large and the measurement noise is very small, then the Kalman filter returns an estimated state that is closer to the measurement, and vice versa. For most students that first encounter the Kalman filter, you’re told the intuition, shown some complicated multivariate Gaussian math and are told to use it in a homework exercise.</p> <p>Most introductory examples of the Kalman filter have only one measurement to use. What if there are more than one, though? How do you handle two different measurements of the same exact thing with a Kalman filter? This is a form of sensor fusion, and when I first learned the Kalman filter I had only a shaky understanding of the solution. The purpose of this blog post is to show how the Kalman filter can perform sensor fusion, and hopefully clarify the machinery of the Kalman filter in the process.</p> <h3 id="the-two-steps-of-the-kalman-filter">The two steps of the Kalman filter</h3> <p>Let us define $p(\mathbf{x})$ as the belief probability distribution of state space vector $\mathbf{x}$. A Kalman filter has two important steps when providing an estimate of the value of $\mathbf{x}$:</p> <ul> <li>Prediction update step (use input $\mathbf{u}$ to update $p(\mathbf{x})$)</li> <li>Measurement update step (use measurement $\mathbf{y}$ to update $p(\mathbf{x})$)</li> </ul> <p>Since the Kalman filter assumes multivariate Gaussian probability distributions, only two quantities are recorded over each step $k$ in the Kalman filter: an estimate vector $\hat{\mathbf{x}}_k$ (corresponding to the mean of the Gaussian), and a covariance matrix $P_k$ (corresponding to the confidence of the measurement). Together, these quantities fully define the probability distribution $p(\mathbf{x})$, so when the Kalman filter updates $p(\mathbf{x})$ in the prediction and the measurement step, it really just updates these two variables.</p> <p>What if we have two separate measurements $\mathbf{y}^a$ and $\mathbf{y}^b$, modeled with their own Gaussian noise? Cutting to the punchline, you perform <em>two</em> measurement steps, one with each measurement:</p> <ul> <li>Prediction update step (use input $\mathbf{u}$ to update $p(\mathbf{x})$)</li> <li>First measurement update step (use measurement $\mathbf{y}^a$ to update $p(\mathbf{x})$)</li> <li>Second measurement update step (use measurement $\mathbf{y}^b$ to update $p(\mathbf{x})$)</li> </ul> <p>This seems like a natural thing to do given two different measurements. Let’s see why this is so.</p> <h3 id="the-bayes-update">The Bayes Update</h3> <p>The Kalman filter is a specific type of filter called a Bayes filter. The Bayes filter also has two steps: one prediction step and one measurement step. In order to change the Kalman filter to incorporate more than one measurement step, we need to understand what each step means in terms of Bayesian estimation.</p> <p>Define the belief distribution at iteration $k-1$ as $p_{k-1|k-1}(\mathbf{x})$. The $k-1|k-1$ part can be read as the probability distribution at time $k-1$, given information up to time $k-1$. The prediction step requires some process model that describes the probability of reaching state $\mathbf{x}^+$ given current state $\mathbf{x}$ and input $\mathbf{u}_{k-1}$:</p> <p>\begin{equation*} p(\mathbf{x}^+|\mathbf{x},\mathbf{u}_{k-1}) \end{equation*}</p> <p>The prediction update is thus described in the Bayes filter as:</p> <p>\begin{equation} p_{k|k-1}(\mathbf{x}) = \sum_{\mathbf{x}’} p(\mathbf{x}|\mathbf{x}’,\mathbf{u}_{k-1}) p_{k-1|k-1}(\mathbf{x}) \end{equation}</p> <p>At time step $k$, we then receive two measurements $\mathbf{y}^a_k$ and $\mathbf{y}^b_k$. The measurement update of the Bayes filter is then used to find the probability distribution of the state given these two measurements, or $p_{k|k}(\mathbf{x})=p(\mathbf{x}|\mathbf{y}^a_k,\mathbf{y}^b_k)$. In the case of a single measurement, Bayes’ rule is used to “switch” the random variable and the conditioned variable:</p> <p>\begin{equation*} p(\mathbf{x}|\mathbf{y}) = \eta p(\mathbf{y}|\mathbf{x})p_{k|k-1}(\mathbf{x}) \end{equation*}</p> <p>The $\eta$ term is a normalization factor, and we are usually unconcerned with it in almost all Bayes filters. The measurement process $p(\mathbf{y}^a|\mathbf{x})$ includes gaussian noise in the Kalman filter, and the probability distribution $p_{k|k-1}(\mathbf{x})$ is found in the prediction update.</p> <p>With two state measurements, this update is only slightly altered. First, perform Bayes’ rule between the state variable $\mathbf{x}$ and a single measurement, say $\mathbf{y}^b$:</p> <p>\begin{equation*} p(\mathbf{x}|\mathbf{y}^a_k,\mathbf{y}^b_k) = \eta^b p(\mathbf{y}^b_k|\mathbf{x},\mathbf{y}^a_k)p_{k|k-1}(\mathbf{x}|\mathbf{y}^a_k) \end{equation*}</p> <p>We can simplify this expression by making the reasonable assumption that the measurement model for $\mathbf{y}^b_k$ is independent of (1) the iteration step $k$ and (2) the value of the other measurement $\mathbf{y}^a$, so that $p(\mathbf{y}^b_k|\mathbf{x},\mathbf{y}^a_k) = p(\mathbf{y}^b|\mathbf{x})$. This results in:</p> <p>\begin{equation} \label{eq:second_measurement} p(\mathbf{x}|\mathbf{y}^a_k,\mathbf{y}^b_k) = \eta^b p(\mathbf{y}^b|\mathbf{x})p_{k|k-1}(\mathbf{x}|\mathbf{y}^a_k) \end{equation}</p> <p>This is great, but we still need $p_{k|k-1}(\mathbf{x}|\mathbf{y}^a_k)$, or the probability distribution of $\mathbf{x}$ conditioned on the value for $\mathbf{y}^a_k$. This term can be found performing Bayes’ rule a second time:</p> <p>\begin{equation} \label{eq:first_measurement} p(\mathbf{x}|\mathbf{y}^a_k) = \eta^a p(\mathbf{y}^a|\mathbf{x})p_{k|k-1}(\mathbf{x}) \end{equation}</p> <p>Eq. \ref{eq:first_measurement} can be thought of as the first measurement update to find the a posteriori probability distribution with respect to measurement $\mathbf{y}^a$. We then take this updated distribution and perform a second measurement update with this update distribution and $\mathbf{y}^b$ and in Eq. \ref{eq:second_measurement}. The result is an estimate that “fuses” two different measurements from different sensor, verifying our intuition.</p> <h3 id="one-final-note">One final note</h3> <p>It should be that incorporating two different measurements should arrive at the same answer, independent of the order in which you incorporate the measurements. Indeed, you can see this by substituting Eq. \ref{eq:first_measurement} into Eq. \ref{eq:second_measurement}:</p> <p>\begin{equation} p(\mathbf{x}|\mathbf{y}^a_k,\mathbf{y}^b_k) = \eta^a\eta^b p(\mathbf{y}^a|\mathbf{x})p(\mathbf{y}^b|\mathbf{x})p_{k|k-1}(\mathbf{x}) \end{equation}</p> <p>The measurement update is symmetric in the label for measurements $a$ and $b$, so it cannot depend on the order in which the measurement updates are applied.</p>Jacob Higginsjdh4je@virginia.eduHow do you fuse measurements together?Entropy as Information2020-12-23T00:00:00-08:002020-12-23T00:00:00-08:00https://jacobhiggins.github.io/posts/2020/12/blog-post<h3 id="how-is-information-quantified">How is information quantified?</h3> <p>This is a big question that I had when first heard about <a href="https://en.wikipedia.org/wiki/Information_theory">information theory</a>. When you or I think of words with a lot of information, we might imagine a juicy bit of gossip spread around office, or maybe a very interesting news article describing the latest political upset. But this kind of information is really subjective, meaning what is a lot of information for one person is old news to another. How do you quantify something as ambiguous as information?</p> <p>First I’ll say that in information theory, information is <em>not</em> about the meaning behind the words. In fact, the words could be literally meaningless. They don’t even have to be words! Sure, we work with things that do ultimately have meanings behind them, but information theory doesn’t care about that. What does information theory care about then? In a nutshell, “surprise.” Information theory tries to quantify how surprised you are by a certain outcome. If you think about it, information and surprise are related in that if you are surprised by something, that something probably contains a good piece of information. Again, though, “surprise” is subjective. How can you quantify these ideas?</p> <p>Information theory relies on the probability of something (or multiple things) happening in order to quantify surprise. If something is unlikely to happen, then you would be surprise if that thing happens. But since it is unlikely to happen, you will not be surprised on average. On the other hand, if something is very likely, then you won’t be surprised on average either. This notion is capture by entropy, which itself is a measure of uncertainty of a system. In this blog post, I talk about how I like to think of entropy and how it relates to information theory. Later, I’ll discuss how these ideas can be applied to autonomous exploration.</p> <h3 id="a-crystal-ball-that-predicts-the-weather">A crystal ball that predicts the weather</h3> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/EntropyAsInformation/crystalball.png" /> </p> <p>Imagine your friend asks you for a crystal ball for Christmas; specifically, this crystal should predict if it will rain or not rain today. This is an example of a binary predictor, where there are two mutually exclusive options to choose from. When your friend queries this crystal ball, it gives them a prediction of rain or no rain. For now, we’ll just look at the probability that the crystal ball is correct, $P(correct)$.</p> <p>You don’t really like your friend, so you want to get them a crystal ball that produces the worst results possible. What would that be? Well, the <em>best</em> results would be a crystal ball that is always right, or $P(correct)=1$. You might reason that the worst result would then be a crystal ball that always gives the wrong answer, or $P(correct)=0$. Here is a table of what the completely wrong crystal ball might predict over the course of five days:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/EntropyAsInformation/wrong_predictions.png" /> </p> <p>At first, you’re friend might be really peeved that they keep getting caught in the rain without an umbrella, but after they dry off they might realize that they can still use the crystal ball: if it is always wrong, then they just assume the <em>opposite</em> prediction that it gives to get the right answer. If it says it will rain, then your friend knows its a sunny day, and if it says there will be no rain that day, they better bring a coat!</p> <p>Imagine taking a multiple choice test where each question has only two options. If you’re looking to always be wrong on each question, you have to know every right answer in one way or another. From the perspective of information theory, always being wrong is equivalent to always being right, since you need the same information to do either. But if this is the case, what makes the worst crystal ball?</p> <p>The worst gift would be a crystal ball that is sometimes right, sometimes wrong. In fact, you’re intuition might be telling you that the worst case scenario is a crystal ball that is correct only $50\%$ of the time, and you would be correct. Think about it: if the crystal ball was, say, $90\%$ correct, then you know that an average you could trust what it tells you. But if the crystal ball was right only half the time, then you wouldn’t be able to tell at all if it will rain that day, not even on average. In other words, the crystal ball that has $P(correct)=P(incorrect)=0.5$ cannot reliably give you any information, and you will always be surprised by what the weather is when you actually walk outside. If instead $P(correct)=1$ or $P(incorrect)=1$, then you will never be surprised.</p> <h3 id="information-content-and-shannon-entropy">Information Content and Shannon Entropy</h3> <p>How can we further strengthen these notions of surprise when dealing with random variables like the weather? For a single outcome, we just saw that the amount of surprise for that outcome is related to its probability $p_i$. In information theory, surprise is quantified by something called the information content $I$, and is defined by the random variable $X$ and a particular outcome of that variable $O$, along with the probability of that outcome $p_O$:</p> <p>\begin{equation} I_X(O) \equiv -\ln(p_O) \end{equation}</p> <p>There is no strict derivation for this quantity, it has no physical meaning, and in a sense it is completely made-up by humans. But, it <em>does</em> have nice properties that make it ideal to work with. We won’t discuss them all here, but notice how if $p_O=1$, then $I_X(O)=0$. In English, if a particular outcome is certain to occur, then we are not at all surprised by the outcome of this random variable. On the other hand, if $p_O\rightarrow 0$, then $I_X(O)\rightarrow\infty$. This means that as an event becomes more and more unlikely to occur, then we become more and more surprised if we actually see that outcome.</p> <p>In our crystal ball example, we can track its information content by considering whether its weather prediction is correct $p_c$ or incorrect $p_{~c}$. If $p_c=1$, then we are not at all surprised if the crystal ball correctly predicts the weather $I(c)=0$, but will be very surprised if it is incorrect $I(~c)=\infty$. Since the information content is tied to the outcome of a random variable, we cannot talk about the information content of the crystal ball; we can only state the information content of the crystal ball <em>when it gives us a particular outcome</em>. How can we talk about the crystal ball without referencing a certain outcome?</p> <p>The answer comes by averaging over all possible outcomes, to find the average information content provided by the crystal ball. The average information content is given by a term that is call the entropy $H(X)$ that is tied to random variable $X$:</p> <p>\begin{equation} H(X) = -\sum_Op_O\ln(p_O) \end{equation}</p> <p>The information entropy quantifies how much information can be gleaned from a random variable $X$ on average. For a binary random variable and assuming both outcomes are mutually exclusive (like our crystal ball), the entropy is given by a sum of two terms:</p> <p>\begin{equation} H = -p_c\ln(p_c) - (1-p_c)\ln(1-p_c) \end{equation}</p> <p>Here is a graph of the entropy as a function of $p_c$:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/EntropyAsInformation/binary_entropy.png" /> </p> <p>The entropy of the crystal ball approaches zero when $p_c=1$ and $p_c=0$. In these scenarios, we are not surprised at the weather when we walk outside because the crystal ball told us, and the crystal ball is very accurate. If $p_c=0.5$, we are maximally surprised by the weather because the crystal ball is maximally inaccurate.</p> <h3 id="using-entropy-of-autonomous-exploring">Using Entropy of Autonomous Exploring</h3> <p>This idea of entropy is used a lot in autonomous vehicle motion planning. In particular, it is used when a robot doesn’t know about its surroundings and must explore to create a map that it can use. Suppose that a robot is tasked with observing the occupancy of room A and room B. Somehow, it already knows the occupancy of room B (either it was given this information, or it had already observed the occupancy of that room). The robot must choose which room to go to:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/EntropyAsInformation/entropy_exploration.png" /> </p> <p>The obvious choice is room A, since observing room B wouldn’t result in any net increase in information. In the context of information theory, we would say that the probability of occupancy of room B is either $p_B=1$ or $p_B=0$, depending on if the room is empty or not (in the above figure, room B is occupied by a pupper). In either case, the entropy of room B is zero. The probability of occupancy of room A on the other hand would be $p_A=0.5$, since we are completely uncertain of occupancy, meaning entropy is maximized at this probability (this is a case of binary classification, occupied or unoccupied). The robot could recognize that if room A is observed, its probability of occupancy would go from $p_A=0.5$ to either $p_A=1.0$ or $p_A=0.0$, reducing the total entropy of both rooms down to zero. Using information theory for a two-room map might seem like overkill, but consider a map made of more rooms:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/EntropyAsInformation/entropy_exploration_many.png" /> </p> <p>As it explores, the robot has to decide what rooms it will explore, and in what order. In the above picture, a human might reason to explore rooms A-D first, simply because they are rooms with unknown occupancy that are clustered together. With information theory, a robot will do the same with a more analytical reason for doing so: by observing rooms A-D first, it will reduce the total entropy of the map more than if it traveled to and observed these four rooms first, more than any other four consecutive rooms (e.g. D-G only has two rooms of unknown occupancy).</p> <p>In realistic applications, the robot is trying to create an occupancy map where it observes the occupancy of the space around it using grid cells. Each grid cell represents a small, discretized section of space that can either be occupied or unoccupied, just like the room example above. Here is a simple cartoon showing this idea:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/EntropyAsInformation/occupancy_grid.png" /> </p> <p>In order to explore its surroundings, robots often seek to reduce the overall entropy of the entire map. By reducing entropy, it reduces the overall “surprise” of where obstacles are in the environment. This is, after all, the ultimate goal of mapping: to not be surprised by what the robot observes!</p> <h3 id="conclusion">Conclusion</h3> <p>The main purpose of this blog post was to give an intuition behind entropy as a measure of information. Specifically, it is shown how entropy can be linked to the “surprise” or information available in a random variable. These ideas were then quickly linked to autonomous navigation when exploring an unknown environment. Although a detailed algorithm was not provided (maybe the subject of future blog posts), the general idea was proposed that a robot can create a better map by reducing the overall entropy of the probabilistic occupancy grid that details the known locations of obstacles in the environment.</p>Jacob Higginsjdh4je@virginia.eduHow is information quantified?Exploring Solutions of the Hamilton-Jacobi-Bellman Equation2020-11-22T00:00:00-08:002020-11-22T00:00:00-08:00https://jacobhiggins.github.io/posts/2020/11/blog-post<h3 id="does-the-hjb-equation-really-find-the-optimal-control-law">Does the HJB equation really find the optimal control law?</h3> <p>In optimal control theory, the Hamilton-Jacobi-Bellman equation is a PDE that gives a necessary and sufficient condition for optimal control with respect to a cost function. In other words, if we can solve the HJB equation, then we find <em>the</em> optimal control law. In most examples detailing how the HJB equation is solved, the discussion stops as soon as the answer is found. This is usually fine, but I’ve always wondered if the answer we find is actually the optimal answer. Can we explore the optimality of HJB equation solutions using graphs?</p> <p>In this blog post, I’ll walk through the basic steps required to find the optimal control law with the HJB equation. After, I’ll play around with optimal solution by graphing solutions that are perturbed by small amounts, showing graphically how these solutions optimize the cost function.</p> <h3 id="solving-the-hjb-equation">Solving the HJB equation</h3> <p>Suppose we define the following cost function:</p> <p>\begin{equation} J(x(t),t) = h(x(T),u(T)) + \int_t^T g(x(\tau),u(\tau) d\tau \end{equation}</p> <p>Here, $g(x,u)$ is a (usually positive definite) function that describes the instantaneous cost that $J(x,u)$ accrues at time $t$. Similarly, $h(x,u)$ describes the final cost at terminal time $T$. Notice how in this context, the cost function is a function of time $t$, and describes the future costs accrued until $T$. This is a consequence of <a href="https://en.wikipedia.org/wiki/Bellman_equation#Bellman's_principle_of_optimality">Bellman’s principle of optimality</a>. I’ll spare you the long explanation, as a lot of other people have already covered it in many different contexts. The basic idea, though, is if I want to achieve a certain state in an optimal way, I should work backwards from that desired state to any other feasible state. By finding an optimal strategy as you work backwards, this must be <em>the</em> optimal strategy if you were to move forward instead. The cost function is also a function of the current state $x(t)$.</p> <p>For a given starting state $x_0$, future states evolve according to some differential equation:</p> <p>\begin{equation} \dot{x} = f(x,u) \end{equation}</p> <p>Optimal control theory seeks a control law $u := u(x)$ such that the cost $J$ is minimized. This is where the HJB equation comes into play. It says that if we find a control law that satisfies the following PDE:</p> <p>\begin{equation} \label{eq:HJB} J^*_t(x,t) + \min_u( g(x,t) + J^*_x \cdot f(x,u) ) = 0, \end{equation}</p> <p>then we find the optimal control law for the associated cost function. Notice that the HJB includes the optimal cost function $J^*(x,t)$. That is the value of the cost function starting at time $t$ and state $x(t)$, following the optimal control law $u = u^*(x)$. This PDE has the following boundary condition:</p> <p>\begin{equation} J^*(x(T),T) = h(x(T),T) \end{equation}</p> <p>Again because of the principle of optimality, our boundary condition is placed at the terminal time $T$.</p> <p>In order to get any further, we have to now define our system and cost function. Let’s define the system dynamics. To keep things simple, I’ll stick with a very basic LTI one-dimensional system:</p> <p>\begin{equation} \dot{x} = ax + bu \end{equation}</p> <p>Now let’s define a reasonable cost function as:</p> <p>\begin{equation} J(x(t),t) = 0.5kx(T) + \int_t^T cx(\tau)^2 + du(\tau)^2 d\tau \end{equation}</p> <p>The coefficients $a$, $b$, $c$, $d$ and $k$ are kept general for now. With these definitions, we can start to actually solve the HJB equation.</p> <p>First, lets focus on the following term:</p> <p>\begin{equation*} \min_u{ g(x,t) + J^*_x \cdot f(x,u) } \end{equation*}</p> <p>Replacing $g(x,t)$ and $f(x,u)$ with how they’re defined above, we get:</p> <p>\begin{equation*} \min_u\left( cx^2 + du^2 + J^*_x \cdot (ax + bu) \right) \end{equation*}</p> <p>Because $J^*$ is the optimal cost function, it is associated with the optimal control $u^*$ and in no way can depend on the yet-specified control $u$. Thus, the term $J^*_x$ cannot depend on $u$. Minimization is performed by finding where the derivative of the term is zero:</p> <p>\begin{equation*} 2du + bJ^*_x = 0 \end{equation*}</p> <p>Rearranging gives the following relationship:</p> <p>\begin{equation*} \label{eq:optimal_control} u^*(x) = -\frac{b}{2d}J^*_x \end{equation*}</p> <p>This is the control law that optimizes the cost, but note that it’s in terms of the cost function! To get the optimal control, we need to cost function. We do this by substituting Eq. \ref{eq:optimal_control} into Eq. \ref{eq:HJB}, we get the following PDE:</p> <p>\begin{equation} J^*_t + cx^2 - \frac{b^2}{4d}(J^*_x)^2 + + axJ^*_x + cx^2 = 0 \end{equation}</p> <p>Like any other PDE, one approach is to guess that the solution $J^*$ has a specific form, then check to see if that’s the solution. Once we find a solution, uniqueness theorem says that it must be <em>the</em> solution. We know that the final cost is $J^*(x(T),T) = 0.5kx^2(T)$, so one reasonable guess is check solutions of the following form:</p> <p>\begin{equation} J^*(x,t) = 0.5K(t)x^2 \end{equation}</p> <p>This way, our boundary term becomes $K(T)=k$. Performing the various partials and plugging in gives an ODE to solve:</p> <p>\begin{equation} \dot{K} = \frac{b^2}{4d}K^2 - aK - c \end{equation}</p> <p>Notice that the $x^2$ terms dropped out, resulting in an first order ODE that solves for one variable $K(t)$ over time. This ODE can be solved using separation of variables. First, rearrange the derivative so that the differentials are on opposite sides of the equality:</p> <p>\begin{equation} \frac{dK}{\frac{b^2}{4d}K^2 - aK - c} = dt \end{equation}</p> <p>Next, break up the bottom quadratic into two roots: $\frac{b^2}{4d}K^2 - aK -c = (K+z_1)(K+z_2)$. I won’t give the general expression for the roots here, since they can easily be solved in most languages. But, with this we can break up the fraction as so:</p> <p>\begin{equation} \frac{dK}{\frac{b^2}{4d}K^2 - aK - c} = dK\left( \frac{1}{K+z_1}\frac{1}{K+z_2} \right) = dK\left( \frac{c_1}{K+z_1} + \frac{c_2}{K+z_2} \right) \end{equation}</p> <p>The constants $c_1$ and $c_2$ are found by combining the two fractions in the final equality so we get the same denominator as the middle equality, and solving so we get the same numerator in the middle equality as well. The constants turn out to be $c_1 = 1/(z_1 - z_2)$ and $c_2 = 1/(z_2 + z_1)$. Let $c=c_2=c_1$ So the integration we perform is:</p> <p>\begin{equation} \int_{K(t)}^{K(T)} \left( \frac{c}{K+z_1} - \frac{c}{K+z_2} \right) dK = \int_t^T dt \end{equation}</p> <p>Doing this results in the final expression for $K(t)$:</p> <p>\begin{equation} K(t) = \frac{z_2(K(T)+z_1)e^{(T-t)/c} - z_1(K(T)+z_2)}{K(T) + z_2 - (K(T) + z_1)e^{(T-t)/c}} \end{equation}</p> <p>Now that we have this, our problem is solved, since we have the optimal control $J^*=0.5K(t)x^2$, and with the optimal control we can get $u^*=-\frac{b}{2d}J_x^*=-\frac{b}{2d}K(t)x(t)$.</p> <p>Our problem is solved, and like I said in the introduction, this is where most problems stop. I want to go a step further and graphically motivate this is, in fact, the policy that minimizes the cost function. Let’s assume the following values for the constants we’ve been working with: $a=-10$, $b=1$, $c=0.25$, $d=0.5$ and terminal condition $K(T)=1$. Also, let us assume that $T=1$ second. First, let us see the motion of system under this control with $x(t=0)=1$:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/OptimalControlExample/optimal_motion_control.png" /> </p> <p>Now let’s introduce a small perturbation to the control policy. Recall that our cost function had a terminal boundary condition $J(x(T),T) = h(x(T),T)$ at $T$ seconds into the future. When perturbing our control, we have to make sure that the perturbation respects this boundary condition. The reason is that boundary conditions are an assumption made when solving a PDE or ODE, and are required to “pin” a solution down in the infinite solution space of the differential equation. Solving the differential equations gives us an specific answer <em>only if</em> we provide boundary conditions. Changing the boundary conditions would mean we’re changing the problem we’re solving, and comparing those solutions would be like comparing apples and oranges.</p> <p>To this end, let us choose a simple perturbation that is zero at $t=T$:</p> <p>\begin{equation} K’(t) = K(t) + \epsilon (T-t) \end{equation}</p> <p>The control is found using $K’(t)$ and the cost is found using the unperturbed $K(t)$. This is because we want to see if different controller is actually more optimal for the same cost function. Plotting the cost over perturbation $\epsilon$ results in the following graph:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/OptimalControlExample/perturbed_cost_1.png" /> </p> <p>Clearly, this cost is minimized when $\epsilon = 0$! What does the control value and motion look like over time with this perturbation? Here is a graph with $\epsilon = 2$:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/OptimalControlExample/perturbed_motion_control.png" /> </p> <p>Notice that by giving $K(t)$ a positive perturbation, we make the control have a bigger magnitude (more negative), but this also means that the state variable approaches $x=0$ faster. This give-and-take ultimately leads to a higher cost. The HJB gives us the cost that perfectly balances between these two competing objectives.</p> <p>What if we instead perturbed the control by a more drastic function:</p> <p>\begin{equation} K’(t) = K(t) + \epsilon\left[ (T-t) + (T-t)^2 + (T-t)^3 + (T-t)^4 + (T-t)^5 \right] \end{equation}</p> <p>Here is the result cost over parameter $\epsilon$:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/OptimalControlExample/perturbed_cost_2.png" /> </p> <p>Again, we find the same thing. Instead of perturbing the function $K(t)$, what if we instead perturbed the control directly? Like so:</p> <p>\begin{equation} u(x,t) = u^*(x,t) + \epsilon\left[ (T-t) + (T-t)^2 + (T-t)^3 + (T-t)^4 + (T-t)^5 \right] \end{equation}</p> <p>In this way, we’re still respecting the terminal boundary condition. This perturbation produces the following graph:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/OptimalControlExample/perturbed_cost_3.png" /> </p> <h3 id="conclusions">Conclusions</h3> <p>These graphs are pretty cool, but one question is if this proves the HJB results in an optimal control policy. The answer is no, these graphs don’t and can’t prove optimality since there an infinite number of perturbations to try. The proof in optimality ultimately lies in the derivation of the HJB to begin with. Nevertheless, these graphs are a really nice way to building an intuition behind the results of the HJB.</p>Jacob Higginsjdh4je@virginia.eduDoes the HJB equation really find the optimal control law?Quick Refresher on Laplace Transform2020-09-08T00:00:00-07:002020-09-08T00:00:00-07:00https://jacobhiggins.github.io/posts/2020/09/blog-post-1<h3 id="why-talk-about-the-laplace-transform">Why talk about the Laplace Transform?</h3> <p>One thing that I think is sorely missing from a lot of texts and online resources that talk about the Laplace transform is a high-level, not-so-mathy discussion on the intuition behind the Laplace transform. This sort of discussion might not only be helpful to a beginner to the subject of control, but also to a long-time practitioner that needs a refresher after several months on another project (like myself). In this blog post, I talk about several high-level concepts regarding the Laplace transform so that it hopefully becomes more intuitive and less abstract for anyone that works with them in control theory.</p> <h1 id="what-does-the-laplace-transform-do">What does the Laplace Transform do?</h1> <p>Before we jump into formal definitions, let’s quickly talk about the general idea behind transforms. Take the following vector that I’ve drawn:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/LaplaceTransform/vector1.png" /> </p> <p>How I’ve drawn it, the vector points equally in the x-direction and y-direction, so it might be define by $&lt;1,1&gt;$ (ignoring magnitude). But this definition is entirely defined by the basis. If instead I used a different basis:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/LaplaceTransform/vector2.png" /> </p> <p>In this prime basis, the vector clear is along the x’ direction more that the y’, so it might be defined by $&lt;0.99,0.01&gt;$ (again, ignoring magnitude). The point is the same vector can be described in different ways, and if I were to go from one basis to another, I would merely be <em>transforming</em> the information that is there.</p> <p>That, in a very small nutshell, is the big idea behind transforms: having different ways of defining the exact same information. This is exactly what the Laplace transform does. In control theory, one way to define a signal is how it looks in time, and is given by a function $f(t)$. The rest of this section talks about what kind of basis the Laplace transform transforms to.</p> <p>The formal definition of the Laplace transform is given below:</p> <p>\begin{equation} F(s) = \int_0^\infty f(t)e^{-st} dt \end{equation}</p> <p>Notice that the time variable $t$ is integrated out, leaving a new variable $s$ and even a new function $F(s)$ to describe the information contained in the signal. Again, the signal itself does not change, <em>we are just looking that information in a different basis</em>.</p> <p>This definition of the Laplace transform is often connected with the Fourier transform, which is given by:</p> <p>\begin{equation} H(\omega) = \int_{-\infty}^\infty f(t)e^{-i\omega t} dt \end{equation}</p> <p>This connection is made by setting $s=i\omega$ and changing the lower bound on the integral to $0$ in the Laplace transform. Clearly, the jump from Fourier to Laplace is a very small one, so I often like to think of the Laplace in terms of Fourier. The Fourier transform takes a function that is defined in time or space and transforms it to a frequency space, so that the function $H(\omega)$ describes the magnitude of the sinusoidal signal at frequency $\omega$ that comprises the time signal f(t). This is definition that has a strong physical connection, so I find this the most intuitive way of approaching the Laplace transform.</p> <p>The big difference between the two transforms is that $\omega$ is strictly real in the Fourier transform, but $s$ is allowed to be complex in the Laplace transform. What does this difference do? That is a very interesting question, and probably the subject of a future blog post. The answer lies in the fact that $e^{-i\omega t}$ is an <em>orthogonal</em> basis with respect to $\omega$, while $e^{-st}$ is not orthogonal with respect to a generally complex $s$ variable. Orthogonality has its roots in linear algebra, and basically asks how mathematically’’ similar are two basis functions. This allows me to say that if $H(\omega = 1 \text{Hz}) = 10$, the signal in the time domain has a frequency of $1$ Hz and the amplitude of that signal has a magnitude of $10$. The Fourier transform has a strong physical interpretation. If, however, I find that $F(s = 1) = 10$ in the Laplace domain, I can’t really give that a physical interpretation; that feature simply isn’t there in the Laplace transform. But what we gain with the Laplace transform is practicality when manipulating signals and systems. The Fourier transform simply doesn’t allow the kind of quick analysis of systems that not only oscillate (i.e. Fourier) but also exponentially decay/grow, which is a big component in determining stability.</p> <p>So, in the end: the Fourier transform is a nice way of making physical connections, and is helpful to keep in mind, but the Laplace transform is easier to work with when determining stability of the system, so control theory works with Laplace.</p> <h1 id="applying-laplace-transforms-in-control-theory">Applying Laplace Transforms in Control Theory</h1> <p>With an understanding of what the Laplace Transform does, let’s turn to how its used in control theory. There are two ways that a Laplace analysis can show up in control:</p> <ul> <li>Transforming a signal</li> <li>Finding a transfer function</li> </ul> <p>Below is a picture of the simplest block diagram I can think of:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/LaplaceTransform/system_block.png" /> </p> <p>The idea is this: we have a signal $u(t)$ that is defined in time, and the system takes that signal and outputs another signal $y(t)$, also a function of time. What if instead of working exclusively with time, we work with the Laplace $s$ variable instead? We can describe the signal with $s$ (via the Laplace transformation) and we can describe how that signal is changed in the $s$ basis as it passes through the system (via the transfer function). A priori, we have no reason to believe that this would help. But it does help a lot.</p> <p>Again, almost any text book or online source talks about these ideas individually, so I will not really dive into the details, but instead briefly talk about both steps.</p> <h1 id="transforming-a-signal">Transforming a Signal</h1> <p>In order to transform a signal $f(t)$, the above definition of the Laplace transform is applied. One basic example is a step function, a.k.a. heaviside function $\Theta(t)$, shown below:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/2020/LaplaceTransform/heaviside.png" /> </p> <p>It’s Laplace transform $\Theta(s)$ is given by:</p> <p>\begin{equation} \Theta(s) = 1/s \end{equation}</p> <p>If we input a heaviside function into our system above, then we say $\Theta(t) \equiv u(t)$ and accordingly $\Theta(s) \equiv U(s)$.</p> <p>Not all signals in the time domain have an analytical form in the Laplace domain. In practice, you either (1) work with signals that have well-known Laplace transforms (there are many tables that are easy to google), or (2) you use software like Matlab to perform a numerical analysis in the Laplace domain.</p> <h1 id="finding-a-transfer-function">Finding a transfer function</h1> <p>How does the system take $u(t)$ and output $y(t)$? This is described by a mathematical model (usually a differential equation) that you either derive from first principles or you find empirically from a process called system identification. For example, suppose we look at the mass-on-spring equation of motion derived from Newton’s 2nd law:</p> <p>\begin{equation} \ddot{y}(t) = -a\dot{y}(t) -by(t) + u(t) \end{equation}</p> <p>Here, $y(t)$ is the position of the mass on a spring and $u(t)$ is the force applied to the system, all in the time domain. Using the Laplace transform, we can transform this differential relationship between $u(t)$ and $y(t)$ in the time domain to an algebraic relationship in the Laplace domain:</p> <p>\begin{equation} \frac{Y(s)}{U(s)} = \frac{1}{s^2 + as + b} \end{equation}</p> <p>Here, the algebraic relationship of $Y(s)/U(s)$ is often defined as the transfer function $G(s)$. Furthermore, because of this algebraic relationship, finding the output $Y(s)$ given an input signal $U(s)$ is a simple matter of multiplication: $Y(s) = \frac{Y(s)}{U(s)}U(s) = G(s)U(s)$.</p> <h1 id="discussion-and-concluding-remarks">Discussion and Concluding Remarks</h1> <p>Notice how the input signal $U(s)$ and the transfer function $G(s)$ almost always have the same form: a fraction that has some polynomial of $s$ both in the numerator and in the denominator. To me, this is what makes the Laplace transform so confusing from time to time; because both signal and transfer function <em>look</em> the same, it is easy to conflate their purpose. But, remember that they have very different interpretations! Signals are what we ultimately care about, and the Laplace transform is a simple change of basis for that signal. Transfer functions tell us the relationship between input and output in the Laplace domain.</p> <p>A lot of information (such as stability) can be extracted from $G(s)$ alone, and a lot of time is spent in text books examine this transfer function $G(s)$ in various setups. Always remember, though, that if I want to know what $Y(s)$ looks like, I must choose an input function $U(s)$ and let $Y(s) = G(s)U(s)$. I can’t know the exact output $Y(s)$ without knowing $U(s)$!</p> <p>This is where I’ll end the very brief overview of the Laplace transform in control. Again, the purpose of this post is to provide a theoretical refresher on what the Laplace transform does the grand scheme of things. After this refresher, you can hopefully dive into any control theory text book and have this context in the back of your mind.</p>Jacob Higginsjdh4je@virginia.eduWhy talk about the Laplace Transform?Using the Correct Lagrangian for the Inverted Pendulum2020-08-09T00:00:00-07:002020-08-09T00:00:00-07:00https://jacobhiggins.github.io/posts/2020/08/blog-post-1<h3 id="the-lagrangian-and-the-inverted-pendulum">The Lagrangian and the Inverted Pendulum</h3> <p>The inverted pendulum is a canonical system studied extensively in control theory because it has a simple goal – keep a pendulum upright by moving its base left to right – but the equations of motion are nonlinear, making it a good test application for novel controllers. These nonlinear equations are found through two methods: application of Newton’s second law, and Lagrange’s method. This post will focus on the latter, and how to use Lagrange’s method correctly to get the equations of motion for the inverted pendulum.</p> <p>A quick google search for the equations of the motion for the inverted pendulum gives many different results that look similar, but have different notions and assumptions on the system. Quickly applying any one of these might be good enough for your application, but how do you know for sure your Lagrangian is correct? The only way to find out is to derive it from scratch, which we will do right now.</p> <h1 id="what-is-a-lagrangian">What is a Lagrangian?</h1> <p>The Lagrangian is a function that mixes together the kinetic and potential energy functions of a system in a special way. It is studied extensively in Lagrangian mechanics, which is taught in many undergraduate and graduate physics courses. In most applications, the Lagrangian $L$ is posed as the kinetic energy $T$ of the system minus the potential energy $V$:</p> <p>\begin{equation} L = T - V \end{equation}</p> <p>As is known by most scientists and engineers, the kinetic energy is usually a function of velocity $\dot{x}$, and the potential energy is usually a function of position $x$. So the variables of the Lagrangian can be explicitly stated as:</p> <p>\begin{equation} L(x,\dot{x}) = T - V \end{equation}</p> <p>In order to analyze any system, both $T$ and $V$ need to be defined. Let’s look at the inverted pendulum system to do so.</p> <h1 id="the-inverted-pendulum-system">The Inverted Pendulum System</h1> <p>For simplicity, we denote things like angle orientation and masses according to the <a href="https://en.wikipedia.org/wiki/Inverted_pendulum#:~:text=An%20inverted%20pendulum%20is%20a,additional%20help%20will%20fall%20over.">wikipedia entry on the inverted pendulum</a>. Here is a picture of the system:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/IP1/ip_system.png" /> </p> <!-- ![Inverted Pendulum System](/images/blog_pics/IP1/ip_system.png) --> <p>A force $F$ pushes a cart of mass $M$ left to right, while an inverted pendulum of mass $m$ is left to swing freely about a point fixed on the cart. While this model makes assumptions that are reasonable for many systems (e.g. no frictional forces), there is one assumption that might not be so reasonable: the pendulum is treated as a point mass.</p> <p>The inspiration for this blog post really came when I asked the question: “what if we didn’t make this assumption?” One simple answer is that we can <em>always</em> make this assumption, since we can replace any physical pendulum with a point mass located at the pendulum’s center of mass. The problem is that this answer is wrong, and to see why, consider the following. Suppose we had a rod whose mass was almost all concentrated at its end. In this case, simplifying all mass to a point at the end makes sense:</p> <p><img src="/images/blog_pics/IP1/moment_of_inertia1.png" alt="Moment of Inertia 1" /></p> <p>The moment of inertia for this system is simple: $I_1 = mL^2$. Thus, if rotating at angular speed $\omega$, this system would then have rotational kinetic energy $T_1 = mL^2\omega/2$.</p> <p>Now, cut the mass in half and place one half at the pivot, and one half a distance $2L$ from the pivot:</p> <p><img src="/images/blog_pics/IP1/moment_of_inertia2.png" alt="Moment of Inertia 2" /></p> <p>The moment of inertia for this system is now $I_2 = (m/2)(2L)^2 = 2mL^2$. Even though the center of mass is the same, the moment of inertia has doubled! The immediate consequence is that the rotational energies between the two system would be different if they have the same rotational speed $\omega$. Thus, we can’t simply replace the physical pendulum with a point mass in any situation when rotations are concerned.</p> <p>This is why it can be confusing when using Lagrangian mechanics on system that rotate. How should we proceed?</p> <h1 id="calculating-kinetic-and-potential-energy">Calculating Kinetic and Potential Energy</h1> <p>One fool-proof method of calculating kinetic energy of any system is to take the kinetic energy of a point mass, $T_{pm}=\frac{mv^2}{2}$, and recognize that any physical system can be re-imagined as a collection of small particles of mass $dm$. Adding up the kinetic energies of all these particles (or in the limit of really small particles, taking an integral) gives you the total kinetic energy:</p> <p>\begin{equation} \label{eq:integral_KE} T = \frac{1}{2}\int dmv^2 \end{equation}</p> <p>This neat trick is usually taught at the start of physics undergraduate or calculus 3, and is a go-to if something is confusing you because Eq. \ref{eq:integral_KE} is most always true. In fact, its usually taught <a href="https://www.feynmanlectures.caltech.edu/I_19.html">when finding moments of inertia for uniform objects</a>.</p> <p>As quick example, consider the physical rod rotating about a fixed point:</p> <p><img src="/images/blog_pics/IP1/moment_of_inertia3.png" alt="Moment of Inertia 3" /></p> <p>You learn that the rotational kinetic energy of this system is $\frac{1}{2}I\omega^2$ in introductory physics. One derivation of this formula is start with $\frac{1}{2}\int dmv^2$ and recognize that for points at length $l$ away from the pivot point, the linear velocity is $v=l\omega$. Thus, we can replace this inside the integral:</p> <p>\begin{equation} T = \frac{1}{2}\int dm(l\omega)^2 = \frac{1}{2}\omega^2\int_0^L l^2dm \end{equation}</p> <p>The integral $\int_0^L l^2dm$ is precisely the definition of the moment of inertia $I$, so that we easily recover the formula $\frac{1}{2}I\omega^2$.</p> <p>Now let us write down the kinetic energy in terms of the generalized position coordinates for this system. The generalized coordinates for the inverted pendulum are the position of the base of the cart, $x$, and the angle of the pendulum, $\theta$. The kinetic energy for the cart is easy: $T_{cart}=\frac{M}{2}\dot{x}^2$. In order to find the total kinetic energy of the pendulum, we can take the integral of a bunch of small point mass that make up the pendulum, like we did above. First write the x-y position of a point on the pendulum as a function of the length along the pendulum $l$:</p> <p>\begin{equation*} x_{pend} = x - l\sin(\theta) \end{equation*} \begin{equation*} y_{pend} = l\cos(\theta) \end{equation*}</p> <p>Next, take the derivative with respective to time:</p> <p>\begin{equation*} \dot{x}_{pend} = \dot{x} - l\dot{\theta}\cos(\theta) \end{equation*} \begin{equation*} \dot{y}_{pend} = -l\dot{\theta}\sin(\theta) \end{equation*}</p> <p>The kinetic energy of a point with mass $dm$ along the length of of the pendulum is thus given by: \begin{equation} KE_{pend}(l) = \frac{dm}{2}\left(\dot{x}^2 + l^2\dot{\theta}^2 - 2l\dot{\theta}\dot{x}\cos(\theta)\right) \end{equation}</p> <p>In order to find the total kinetic energy, let’s sum these up, i.e. let’s take the integral over $dm$:</p> <p>\begin{equation*} KE_{pend} = \int_0^LKE_{pend}(l) \end{equation*}</p> <p>Let’s take the terms in the sum one at a time. The first term is simple:</p> <p>\begin{equation*} \int_0^L\frac{dm}{2}\dot{x}^2 = \frac{\dot{x}}{2}\int_0^Ldm \end{equation*}</p> <p>Here, defining $dm$ is simple. Along the length of the pendulum, each $dm$ corresponds to a small amount of mass that we add up with the integral. We can define $dm$ based on two assumptions: (1) the mass of the pendulum is uniformly distributed, and (2) the mass of th pendulum must be $m$. Due to (1), the value of $dm$ must be directly proportional to a small length of pendulum $dl$ and independent of $l$ at any point, i.e. $dm=kdl$. By choosing $k=\frac{m}{L}$, we can satisfy point (2): $\frac{m}{L}\int_0^Ldl=\frac{m}{L}L=m$. So this term is:</p> <p>\begin{equation*} \frac{m\dot{x}^2}{2} \end{equation*}</p> <p>This is the energy of the system as it translates from left to right.</p> <p>The second term is also easy, as the integral $\frac{\dot{\theta}^2}{2}\int_0^L dm l^2$ is exactly the same as the example above. It simplifies to $\frac{1}{2}I\dot{\theta}^2$. Since we are assuming the mass of the pendulum is uniformly distributed, we can immediately make the substitution $I=\frac{1}{3}mL^2$. This term represents the rotational energy of the system.</p> <p>Finally, the last term is a coupling term, since it contains both $\dot{x}$ and $\dot{\theta}$ – this is the term that is responsible for coupling the translational degree of freedom $x$ with the rotational degree of freedom $\theta$. The resulting integral is:</p> <p>\begin{equation*} \dot{x}\dot{\theta}\cos(\theta)\frac{m}{L}\int_0^L l dl = \dot{x}\dot{\theta}\cos(\theta)\frac{mL}{2} \end{equation*}</p> <p>If we assumed the pendulum was a point mass, we would arrived at a similar coupling term of $\dot{x}\dot{\theta}\cos(\theta)mL$. The factor of 2 comes from the fact that the center of mass for our physical pendulum is located at $L/2$ away from the pivot. Otherwise, the term would be the same!</p> <p>The potential energy is found is a similar way. Since gravity is the only conservative force present, we can write $V=\int_0^Ldm g h=\frac{mg}{L}\cos(\theta)\int_0^Lldl=\frac{mgL}{2}\cos(\theta)$, where we used $h=l\cos(\theta)$.</p> <h1 id="the-correct-lagrangian-eoms-and-simulation">The Correct Lagrangian, EOMs and Simulation</h1> <p>With the kinetic energy $T$ and potential energy $V$ defined above, we are then free to write down the Lagrangian $L=T-V$:</p> <p>\begin{equation} L = \frac{1}{2}(M+m)\dot{x}^2 + \frac{1}{6}mL^2\dot{\theta}^2 - \frac{mL}{2}\dot{x}\dot{\theta}\cos(\theta) - \frac{mgL}{2}\cos(\theta) \end{equation}</p> <p>Applying the <a href="https://en.wikipedia.org/wiki/Euler%E2%80%93Lagrange_equation">Euler-Lagrange</a> relationship in the variables $x$,$\dot{x}$,$\theta$ and $\dot{\theta}$ gives the resulting EOMs for the inverted pendulum. I won’t go through the steps here, but many others have done so in other parts of the web, so I hope it wouldn’t be too difficult to tackle.</p> <p>Here are the final equations of motion:</p> <p>\begin{equation} \ddot{x} = \frac{1}{M+m}\left( F + \frac{mL}{2}\ddot{\theta}\cos(\theta) - \frac{1}{2}\dot{\theta}^2mL\sin(\theta) \right) \end{equation}</p> <p>\begin{equation} \label{eq:ddot_theta} \ddot{\theta} = \frac{3}{2L}\left( g\sin(\theta) + \ddot{x}\cos(\theta) \right) \end{equation}</p> <p>Here, we defined $F$ as the (controllable) force on the base of the pendulum.</p> <p>Notice how $\ddot{\theta}$ shows up when defining $\ddot{x}$, and vice versa. While this is fine from a physical standpoint, when simulating physics (i.e. integrating ODEs) we often want the highest derivative of the system to be defined only in terms of lower derivatives. For this reason, we can rearrange the equations like so:</p> <p>\begin{equation} \label{eq:ddot_x_simp} \ddot{x} = \frac{F + \frac{3mg}{2}\sin(\theta)\cos(\theta)-\frac{mL}{2}\dot{\theta}^2\sin(\theta)}{M + (1 - \frac{3}{4}\cos^2(\theta)m)} \end{equation}</p> <p>In this form, we only need $\dot{\theta}$ and $F$ in order to calculate $\ddot{x}$. We may do that same thing with $\ddot{\theta}$, but a quick shortcut is to use the value of $\ddot{x}$ from \ref{eq:ddot_x_simp} and plug it into \ref{eq:ddot_theta}.</p> <p>Below is a simulation that I coded in matlab that shows these equations in action. Initially, I give the cart and pendulum a small push to the left. You can see both interact with each other:</p> <p align="center"> <img width="460" height="300" src="/images/blog_pics/IP1/inverted_pend.gif" /> </p>Jacob Higginsjdh4je@virginia.eduThe Lagrangian and the Inverted PendulumReinforcement Learning 1: Policy Iteration, Value Iteration and the Frozen Lake2020-06-22T00:00:00-07:002020-06-22T00:00:00-07:00https://jacobhiggins.github.io/posts/2020/06/blog-post-1<h3 id="first-steps-in-reinforcement-learning">First Steps in Reinforcement Learning</h3> <p>Reinforcement learning as a whole is concerned with learning how to behave to get the best outcome given a situation. Although there are many areas of application, the most well-known is video games. Given where you are in the virtual world and the position of the enemies around you, what’s the best action to take? Should you walk forward or jump on the platform above? These questions are split-second decisions for humans, but are non-trivial for a computer to figure out.</p> <p>In practice, reinforcement learning operates a lot like how you are I might learn to play a video game. We give the computer goals to achieve and things to avoid, and it takes many attempts (called “episodes”) to figure out how to play the game; if there’s an enemy ahead, jump, else move forward. Codifying these ideas into a mathematical framework is the major idea behind reinforcement learning.</p> <p>In this post, I’ll review the basic ideas behind reinforcement learning and discuss two basic algorithms - policy iteration and value iteration. I’ll also explain how to use OpenAI Gym, a popular python package used for testing different RL algorithms. With it, I’ll use policy iteration and value iteration to teach a computer how to walk on a frozen lake.</p> <p>The next section is a very brief overview of basic concepts in RL. Although they aren’t difficult, there are many nuances and side discussions that are worth having but aren’t included for brevity. The reader is directed to <a href="https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf">Sutton and Barro’s Introduction to Reinforcement Learning</a>, a text that deservedly holds a place as the first text almost everyone encounters when first learning about RL. The notation I use in this post is borrowed from that book.</p> <h1 id="the-best-action-to-take-the-foundation-of-rl">The Best Action to Take: The Foundation of RL</h1> <p>So, how does a computer know what to do in a given situation? In order to answer that question, we need to set the stage and define the terminology of reinforcement learning.</p> <p>A state $s_t$ defines the what the player and/or environment looks like at time $t$. For example, $s_t$ might describe the position the player is at, or where the enemies are located. In this state, there is a set of actions that the player can take. Go left? Go right? Jump? Each action that a player can take is denoted as $a$.</p> <p>If we are in state $s$ and decide to take action $a$, we are taken to some state $s’$ (which could, generally speaking, be the same state $s$). The mathematical description of what takes us from state $s$ to state $s’$ under action $a$ called the <em>model</em>. In some sense, one can think of the model as the equations of motion. If I am sitting in a car at rest (state $s$) and put the pedal to the metal (action $a$), then my car move forward and increase its speed (new state $s’$).</p> <p>In general, there is some probability of entering state $s’$ from state $s$ under action $a$. This probability can be described as:</p> <p>\begin{align} \label{eq:transition_noreward} p(s’|s,a) \end{align}</p> <p>For those that might be unfamiliar with this notation, the above notation says: given some current state $s$ and action $a$ (terms to the right of the bar), what is the probability to transitioning to state $s’$ (terms to the let of the bar)? These transition probabilities describe a Markov Decision Process (MDP). An MDP is much like a Markov chain, accept the probability are also determined by an action that we (or a computer that plays for us) must choose. From a state $s$, choosing an action $a$ determines the probabilities and states $s’$ to which our system may transition.</p> <p>Usually, we choose an action based on our current state $s$. And, just like humans, RL algorithms keep track of a strategy, or <em>policy</em>, that it uses when it encounters a state. This policy is denoted by $\pi(a|s)$, which is the probability of choosing action $a$ given that we are in the current state $s$.</p> <p>Aside: one question you might have is why we bother to define probabilities state-action $(s,a)$ to new state $s’$, $p(s’|s,a)$, and probabilities from state to action, $\pi(a|s)$ separately; why not combine them as $\sum_a p(s’|s,a)\pi(a|s) = p(s’|s)$, summing over all possible actions from state $s$ to state $s’$ and eliminating this action variable? Well for one, usually only a single action $a$ can take you from $s$ to $s’$, so that sum reduces to a single term. But the bigger idea is that these two probabilities describe two very different things. The transition probabilities $p(s’|s,a)$ describe how the world works, and is largely out of our control. What actions we take, however, are in our control. In RL, finding the best policy $\pi(a|s)$ is the goal, or in other words, RL seeks to find the best action to take for any state $s$ we might find ourselves in.</p> <p>The last piece of the puzzle is telling the computer when it does something good, and when it does something bad. This is done through the use of <em>rewards</em>. Rewards are typically define by certain states that the system can achieve. For Mario, his goal is to reach the end of the level, so when he reaches the end, we might give him a positive reward. If Mario hits a goomba or falls in a hole, we might give him a negative reward. Mario would then associate the actions that led him to his most recent outcome with a positive reward (if he completed the level) or a negative reward (if he died). This would help him update his policy for achieve a higher reward or help him avoid a negative reward.</p> <p>Rewards are incorporated into the transition probabilities define above, so Eq. \ref{eq:trans_noreward} is altered only slightly:</p> <p>\begin{align} \label{eq:trans_prob} p(s’,r|s,a) \end{align}</p> <p>Eq. \ref{eq:trans_prob} is read as the probability of transitioning to state $s’$ and receiving reward $r$, starting in state $s$ and applying action $a$. As mentioned before, rewards are usually associated with states $s’$. Positive rewards are associated with desired states (e.g. goals in a game) and negative rewards are associated with undesired states (e.g. hitting a goomba). Other than that general guideline, rewards can be defined as anything we might want. In general, the policy that the RL algorithm ultimately learns is dependent on how we define the rewards, and there common cases where poorly-defined rewards cause undesired behavior. But that is something we can explore later.</p> <p>With this whole system of rewards and transitions in place, what exactly do we want the RL algorithm to do? Every time the policy chooses an action and the system transitions from one state to another, the computer gets some reward $r$. This reward is cumulative, meaning each step in time gains some reward $R_{t+1} = R_t + r$, where we usually initialize $R_0=0$. The total reward is then dependent on the starting state and the subsequent actions that we took.</p> <p>Suppose we are at time $t$ in the current game. If we are trying to decide what action to take next, one logical thing to do would be to try to look into the future and <em>maximize our future rewards</em>. If we currently have cumulative reward $R_t$, then the expected return $G_t$ is simply the sum of all future rewards:</p> <p>\begin{align} \label{eq:return_nodiscout} G_t = R_{t+1} + R_{t+2} + … + R_{T} \end{align}</p> <p>Here, $T$ is the maximum time of playing. Because each of these $R_{t+i}$ future rewards are generally stochastic variables, we would like to <em>maximize the expected return</em>. If there is no maximum time of playing $T$, the sum in infinite and there is sometimes a risk of a reward accumulating to infinity. In this case, the maximum expected return would also be infinite. Since infinites are usually difficult to work with mathematically, all future rewards from current time $t$ are “discounted” by a factor $\gamma$. The discounted return is defined by:</p> <p>\begin{align} \label{eq:discounted_return} G_t = R_{t+1} + \gamma R_{t+2} + … = \sum_{k=0}^\infty\gamma^kR_{t+k+1} \end{align}</p> <p>In order for this infinite sum to remain finite, the discount factor $\gamma\in[0,1]$. One can think of $\gamma$ as a factor that reduces the importance of rewards from the current state. If $\gamma=0$, then our strategy to maximize $G_t$ would be to take actions that immediately give us positive reward $r$. If $\gamma=1$, then our strategy would look at long-term rewards and perhaps would take a negative reward now for the chance of a bigger positive reward later.</p> <p>Lastly, let’s discuss value functions and action-value functions. A value function $v_\pi(s)$ is simply the expected value for the return, given a policy $\pi(a|s)$, starting in state $s$:</p> <p>\begin{align} v_\pi(s) = E_\pi [ G_t | S_t=s ] = E_\pi \left[ \sum _{k=0}^\infty \gamma^k R _{t+k+1}|S_t=s \right] \end{align}</p> <p>Here, $E[ \cdot ]$ is the expectation value of a stochastic variable. For anyone unfamiliar, the expectation value is exactly the same as the average value: if start in state $s$ a million times and use policy $\pi$ each time, what will be the average reward that we get? As you might guess, getting this value (or at least a good estimate of this value) involves the state transition probabilities.</p> <p>The action-value function is defined in a similar way, except it describes the expected return if we start in state $s$, take action $a$, and then afterwards always follow the same policy $\pi$:</p> <p>\begin{align} q_\pi(s,a) = E_\pi [G_t | S_t=s, A_t=a ] = E_\pi\left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t=s, A_t=a\right] \end{align}</p> <p>Note that $v_\pi(s) = q_\pi(s,\pi(s))$.</p> <p>The value function and the action-value function are essentially relative measures of the policy $\pi$. If $\pi$ is a terrible policy, then the value function will be small (probably zero or even negative) values for all states $s\in S$. There is at least one policy that is better that all the rest, producing the maximum value function and action value function. In RL, the goal then is to find the policy that maximizes the value function for all state values:</p> <p>\begin{align} v_*(s) = \text{max}_{\pi} v_pi(s) \end{align}</p> <p>Optimal policies also share the same optimal action-value function:</p> <p>\begin{align} q_*(s,a) = \text{max}_{\pi}q(s,a) \end{align}</p> <p>Our goal for this post is to find the optimal policy, given transition probabilities $p(s’,r|s,a)$ and reward values $r$. The two standard ways of doing so is with value iteration and policy iteration.</p> <h1 id="value-iteration">Value Iteration</h1> <p>We’ll first start with value iteration, as I believe it is the easier to understand conceptually. I’ll show the algorithm, then step through the first several iterations:</p> <ol> <li>Start with model $p(s’,r|s,a)$</li> <li>Initialize value function $v(s)=0$ for all states $s$</li> <li>Initialize new value function $v’(s)=0$ to be used in do-while loop</li> <li>do:</li> <li>  for $s\in S$:</li> <li>    $v(s) = v’(s)$</li> <li>    $v’(s) = \text{max}_a \sum _{s’\in S} P(s’,r|s,a)\left( r + \gamma v(s’) \right)$</li> <li>while $|v’(s) - v(s)|_1 &lt; \text{small number}$</li> <li>return v(s)</li> </ol> <p>In order to show what this algorithm does, lets look at a simple example. Below is a 1D world comprised of seven squares. Our goal is to reach the rightmost square, and avoid the leftmost square. If we reach either, the game is over. In accordance with this goal, we set the following reward of +10 if we reach the right square, and a reward of -1 if we reach the left square. All other state-action pairs result in a reward of zero. For our small world, let $\gamma=1$.</p> <p>Inside each square is the value for $v(s)$, which we initialize to zero everywhere:</p> <p><img src="/images/blog_pics/RL1/fig0.png" alt="Initial Value Function" /></p> <p>Our action space is simple: we can move left, move right or stand still. Every time we take an action, we are certain to complete that action. This means that, for our model, our probabilities are all ones or zeros. For example, $p(s’=4,r=0|s=3,a=\text{right})=1$, and $p(s’=7,r=10|s=2,a=\text{right})=0$.</p> <p>Now let’s go through the algorithm. We already initialized our value function to zero for all states (step 2), so now we enter the do-while loop. For all states, look at each action and the associated reward $r$ and discounted value $\gamma v(s’)$. For the first iteration of this loop, this simply returns the rewards associated with the two end-game states:</p> <p><img src="/images/blog_pics/RL1/fig1.png" alt="Value Function, First Iteration" /></p> <p>Line 6 of the algorithm simply says look at the biggest change in the value function. If the biggest change is bigger than some small number, then continue the loop. In our case, square 7 saw a change of +10. So we continue the loop.</p> <p>Two quick notes:</p> <ul> <li>Rewards for states 1 and 7 are given only when you <em>leave</em> a state. So no matter what action is taken in those states, you get the same reward and the episode is ended.</li> <li></li> </ul> <p>Because states 1 and 7 are end-game states, their values cannot change - we can’t get any more reward, because the game has ended! So we’ll focus on the other states and see how they change.</p> <p>Below is a list of how the value of each state is updated on the second iteration:</p> <ul> <li>State 2: maximum action = stay, $v’(s=2) = r + \gamma v(2) = 0$</li> <li>State 3: maximum action = stay, $v’(s=3) = r + \gamma v(3) = 0$</li> <li>State 4: maximum action = stay, $v’(s=4) = r + \gamma v(4) = 0$</li> <li>state 5: maximum action = stay, $v’(s=5) = r + \gamma v(5) = 0$</li> <li>state 6: maximum action = <strong>right</strong>, $v’(s=6) = r + \gamma v(7) = 10$</li> </ul> <p>Now the value function looks like so:</p> <p><img src="/images/blog_pics/RL1/fig2.png" alt="Value Function, Second Iteration" /></p> <p>Let’s go ahead and do the third iteration:</p> <ul> <li>State 2: maximum action = stay, $v’(s=2) = r + \gamma v(2) = 0$</li> <li>State 3: maximum action = stay, $v’(s=3) = r + \gamma v(3) = 0$</li> <li>State 4: maximum action = stay, $v’(s=4) = r + \gamma v(4) = 0$</li> <li>state 5: maximum action = <strong>right</strong>, $v’(s=5) = r + \gamma v(6) = 10$</li> <li>state 6: maximum action = <strong>right</strong>, $v’(s=6) = r + \gamma v(7) = 10$</li> </ul> <p>The value function is now:</p> <p><img src="/images/blog_pics/RL1/fig3.png" alt="Value Function, Third Iteration" /></p> <p>Notice how information of state seven’s reward of +10 propagates backwards to the other states. For this reason, line 7 in the above algorithm is sometimes called the Bellman Backup Operation. Every iteration, the value function we’re computing gets closer to the optimal solution, and it does so by backing up’’ the information of rewards to all other states.</p> <p>The program terminates on the optimal value function:</p> <p><img src="/images/blog_pics/RL1/fig4.png" alt="Value Function, Final" /></p> <p>The optimal value function above shows that every state can achieve a high reward. Given this value function, what is the optimal policy we should follow? That is simple: choose the action that maximizes $r + \gamma v_*(s’)$. In this example, though, you may notice a problem: there are states where what action to take is ambiguous. For example, both state three and state five have a value of +10, so that a policy trying to learn about state four has to deal with this ambiguity.</p> <p>One simple way to address this problem is by setting the discount factor $0 \le \gamma\le 1$. Suppose we perform the same algorithm outlined above, but instead set $\gamma = 0.9$. The resulting optimal value function would look like:</p> <p><img src="/images/blog_pics/RL1/fig7.png" alt="Value Function, Final + Discounted" /></p> <p>Now it is clear where to go no matter what state we’re in - simply follow the path of increasing reward.</p> <p>There is a second iterative to find the optimal policy, called policy iteration.</p> <h1 id="policy-iteration">Policy Iteration</h1> <p>As the name suggests, policy iteration iterates through policies until it converges on the optimal policy. Below is the algorithm:</p> <ol> <li>Start with model $p(s’,r|s,a)$</li> <li>Initialize policy $\pi(a|s)$ randomly for all states $s$</li> <li>Initialize current value function $v(s)=0$ for all states</li> <li>do:</li> <li>  for $s\in S$:</li> <li>    $v(s) = v(s’)$</li> <li>    $v’(s) = \sum _{s’\in S, a \in A} P(s’,r|s,a)\pi(a|s)\left( r + \gamma v(s’) \right)$</li> <li>while $|v’(s) - v(s)|_1 &lt; \text{small number}$</li> <li>Initialize new policy $\pi’(a|s)$</li> <li>for $s\in S$:</li> <li>  $\pi’(a|s) = \text{max}_a \sum _{s’} p(s’,r|s,a)\left( r + \gamma v(s’) \right)$</li> <li>if $\pi’ \neq \pi$:</li> <li>  go to line 4</li> <li>return $\pi(a|s)$</li> </ol> <p>The idea is this: first, initialize a random policy. Then, find the value function for that policy (lines 4-8). With the value function, construct a new policy $\pi’(a|s)$ that maximizes $r + \gamma v(s’)$, i.e. immediate reward (by taking some action) plus discounted future rewards (by following old policy $\pi$ thereafter). If $\pi’ = \pi$, then you have found the optimal policy. If not, then return to line 4 with this new policy to relearn its value function.</p> <p>This process has an extra step that value iteration, so it might be a little more confusing, but it isn’t too bad. To illustrate how this works, let’s go back to the 1D world, but instead let’s find the optimal policy using policy iteration instead. First, we start with an initial policy. In practice, this is usually randomized, but for our example let’s suppose we start with:</p> <p>\begin{equation} \pi(a=\text{left}|s)_0 = 1 \quad \forall s \end{equation}</p> <p>In english, this means that no matter what state we are in, we choose to move to the left. Lines 4-8 of the policy iteration algorithm simply converges to the value function for this policy, which is:</p> <p><img src="/images/blog_pics/RL1/fig5.png" alt="Value Function, Final" /></p> <p>Unless we are already in state 7, we expect to get a reward of -1 for all states since we’re always moving to the left. This is terrible! But lines 9-10 iterate through each state and asks if there is a better immediate action that can be taken before following this policy. It will find:</p> <ul> <li>State 2: maximum action = left, $r + \gamma v(2) = -1$</li> <li>State 3: maximum action = left, $r + \gamma v(3) = -1$</li> <li>State 4: maximum action = left, $r + \gamma v(4) = -1$</li> <li>state 5: maximum action = left, $r + \gamma v(6) = -1$</li> <li>state 6: maximum action = <strong>right</strong>, $r + \gamma v(7) = 9$</li> </ul> <p>When the algorithm investigates state 6, it finds that going right is a better action to take than the current policy of going left. So now our policy is:</p> <p>\begin{equation*} \pi(a=\text{left}|s)_1 = 1, \quad s = 1-5,7 \end{equation*}</p> <p>\begin{equation*} \pi(a=\text{right}|s)_1 = 1, \quad s = 6 \end{equation*}</p> <p>The policy is now to move right in state 6, and move left in all other states. Since our policy changed, line 11 says to go back and find the value function of this new policy. Unsurprisingly, this new value function will look like:</p> <p><img src="/images/blog_pics/RL1/fig6.png" alt="Value Function, Final" /></p> <p>Hopefully you see where this is going. Again, information about the goal at state 7 is propagated back to all the other states, the optimal policy is to always move right, and the optimal value function will look like:</p> <p><img src="/images/blog_pics/RL1/fig7.png" alt="Value Function, Final" /></p> <h1 id="setting-up-openai-gym">Setting Up OpenAI Gym</h1> <p>Now that we have talked about the basics of RL and two algorithms that we can use, let’s use these techniques on a slightly more complex example.</p> <p><a href="https://gym.openai.com/">OpenAI Gym</a> is a pretty cool python package that provides ready-to-use environments to test RL algorithms on. As a python package, it is pretty easy to install:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>gym </code></pre></div></div> <p>They have all sorts of environments to play around in, and I encourage you to see all that it has to offer. But for this post, we’re going to use the <a href="https://gym.openai.com/envs/FrozenLake-v0/">frozen lake</a> environment. This is a 2D grid a squares, and the goal is start in the upper left corner and reach the lower right corner while avoid some squares (‘‘holes’’ in the lake). The rest of this post will show code that uses value iteration and policy iteration to find the optimal policy to get to the goal while avoiding the holes.</p> <p><a href="https://github.com/jacobhiggins/jacobhiggins.github.io/tree/master/files/blog_files/RL1">Here</a> is a link that contains starter code (for anyone who wishes to try this themselves), as well as completed code. The starter code was taken from <a href="http://web.stanford.edu/class/cs234/assignment1/index.html">Stanford’s RL Class</a>.</p> <p>So, let’s get coding.</p> <h1 id="frozen-lake-policy-iteration">Frozen Lake: Policy Iteration</h1> <p>Inside the starter code folder, take a look at the vi_and_pi.py file; this is where you’ll add the code to make everything run. The main function contains a line that accesses the OpenAI library and loads the frozen lake environment:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">env</span> <span class="o">=</span> <span class="n">gym</span><span class="p">.</span><span class="n">make</span><span class="p">(</span><span class="s">"Deterministic-4x4-FrozenLake-v0"</span><span class="p">)</span> </code></pre></div></div> <p>In this environment, states and actions are denoted using integers. The 4x4 frozen lake defined above references its states as numbers 0 - 15, and all possible actions as 0 - 3 (up, down, left, right).</p> <p>The env variables contains all the information needed for RL. Specifically, some of its fields include:</p> <ul> <li>env.P: a nested dictionary that describes the model of the frozen lake, where P[state][action] returns a list of tuples. Each tuple has the same form: probability, next state, reward, terminal. These describe: <ul> <li>Probability: probability of transition to next state</li> <li>Next state: The possible next state described by the tuple</li> <li>Reward: The reward gained from this state-action pair</li> <li>Terminal: True when a terminal state is reached (i.e. hole or goal), False otherwise</li> </ul> </li> <li>env.nS: the number of total states</li> <li>env.nA: the number of total actions</li> </ul> <p>The file contains two functions called policy_iteration and value_iteration. These functions take in a frozen lake environment and perform policy iteration or value iteration until they converge to the optimal policy/value function, or the maximum number of iterations is reached.</p> <p>Let us first look at policy iteration. To aid the coding process, the starter code also provides empty functions policy_evaluation and policy_improvement that are to be used when performing policy iteration. The policy_evaluation function returns the value function of the current policy (lines 4-8 of the policy iteration algorithm) and policy_improvement returns the improved policy using this value function (lines 10-12). Using these functions, here’s one way to fill policy_iteration:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">policy_iteration</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">nS</span><span class="p">,</span> <span class="n">nA</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">tol</span><span class="o">=</span><span class="mf">10e-3</span><span class="p">):</span> <span class="n">value_function</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nS</span><span class="p">)</span> <span class="n">policy</span> <span class="o">=</span> <span class="mi">0</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">(</span><span class="n">nS</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span> <span class="c1">############################ </span> <span class="c1"># YOUR IMPLEMENTATION HERE # </span> <span class="n">flag</span> <span class="o">=</span> <span class="bp">True</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">while</span> <span class="n">flag</span> <span class="ow">and</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">100</span><span class="p">:</span> <span class="n">value_function</span> <span class="o">=</span> <span class="n">policy_evaluation</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">nS</span><span class="p">,</span> <span class="n">nA</span><span class="p">,</span> <span class="n">policy</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">tol</span><span class="p">)</span> <span class="n">new_policy</span> <span class="o">=</span> <span class="n">policy_improvement</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">nS</span><span class="p">,</span> <span class="n">nA</span><span class="p">,</span> <span class="n">value_function</span><span class="p">,</span> <span class="n">policy</span><span class="p">,</span> <span class="n">gamma</span><span class="p">)</span> <span class="n">diff_policy</span> <span class="o">=</span> <span class="n">new_policy</span><span class="o">-</span><span class="n">policy</span> <span class="k">if</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">diff_policy</span><span class="p">)</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span> <span class="n">flag</span> <span class="o">=</span> <span class="bp">False</span> <span class="n">policy</span> <span class="o">=</span> <span class="n">new_policy</span> <span class="n">i</span><span class="o">+=</span><span class="mi">1</span> <span class="k">if</span><span class="p">(</span><span class="n">i</span><span class="o">==</span><span class="mi">100</span><span class="p">):</span> <span class="k">print</span><span class="p">(</span><span class="s">"Policy iteraction never converged. Exiting code."</span><span class="p">)</span> <span class="nb">exit</span><span class="p">()</span> <span class="c1">############################ </span> <span class="k">return</span> <span class="n">value_function</span><span class="p">,</span> <span class="n">policy</span> </code></pre></div></div> <p>The above code first initializes the policy as all zeros for each state (i.e. always move to the left). It then finds the value function for this policy and finds improvement to this policy. If the improved policy is the same as the previous policy, then we have found an optimal policy and we exit the function; else, repeat the process.</p> <p>Below are the completed policy_evaluation and policy_improvement functions.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">policy_evaluation</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">nS</span><span class="p">,</span> <span class="n">nA</span><span class="p">,</span> <span class="n">policy</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">tol</span><span class="o">=</span><span class="mf">1e-3</span><span class="p">):</span> <span class="n">value_function</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nS</span><span class="p">)</span> <span class="c1">############################ </span> <span class="c1"># YOUR IMPLEMENTATION HERE # </span> <span class="n">error</span> <span class="o">=</span> <span class="mi">1</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># While error in value function is greater than 1 </span> <span class="k">while</span> <span class="n">error</span> <span class="o">&gt;</span> <span class="n">tol</span> <span class="ow">and</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">100</span><span class="p">:</span> <span class="n">new_value_function</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nS</span><span class="p">)</span> <span class="c1"># For each state </span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nS</span><span class="p">):</span> <span class="c1"># Get policy for that state </span> <span class="n">a</span> <span class="o">=</span> <span class="n">policy</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="c1"># With this policy, get next state </span> <span class="c1"># probability, nextState, reward, terminal = P[i][a] </span> <span class="c1"># value_function </span> <span class="c1"># Find all possible transitions, rewards, etc. </span> <span class="n">transitions</span> <span class="o">=</span> <span class="n">P</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">a</span><span class="p">]</span> <span class="k">for</span> <span class="n">transition</span> <span class="ow">in</span> <span class="n">transitions</span><span class="p">:</span> <span class="n">prob</span><span class="p">,</span> <span class="n">nextS</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">term</span> <span class="o">=</span> <span class="n">transition</span> <span class="c1"># Calculated update value function </span> <span class="n">new_value_function</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">prob</span><span class="o">*</span><span class="p">(</span><span class="n">reward</span> <span class="o">+</span> <span class="n">gamma</span><span class="o">*</span><span class="n">value_function</span><span class="p">[</span><span class="n">nextS</span><span class="p">])</span> <span class="n">error</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">new_value_function</span> <span class="o">-</span> <span class="n">value_function</span><span class="p">))</span> <span class="c1"># Find greatest difference in new and old value function </span> <span class="c1"># print(new_value_function) </span> <span class="c1"># print("error: {}".format(error)) </span> <span class="n">value_function</span> <span class="o">=</span> <span class="n">new_value_function</span> <span class="n">i</span><span class="o">+=</span><span class="mi">1</span> <span class="k">if</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">100</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">"Policy evaluation never converged. Exiting code."</span><span class="p">)</span> <span class="nb">exit</span><span class="p">()</span> <span class="c1">############################ </span> <span class="k">return</span> <span class="n">value_function</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">policy_improvement</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">nS</span><span class="p">,</span> <span class="n">nA</span><span class="p">,</span> <span class="n">value_from_policy</span><span class="p">,</span> <span class="n">policy</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">0.9</span><span class="p">):</span> <span class="n">new_policy</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nS</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="s">'int'</span><span class="p">)</span> <span class="c1">############################ </span> <span class="c1"># YOUR IMPLEMENTATION HERE # </span> <span class="c1"># For each state </span> <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nS</span><span class="p">):</span> <span class="c1"># Get optimal action </span> <span class="c1"># If ties for optimal exist, choose random </span> <span class="n">Qs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nA</span><span class="p">)</span> <span class="c1"># For each action </span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nA</span><span class="p">):</span> <span class="c1"># All possible next states from this state-action pair </span> <span class="n">transitions</span> <span class="o">=</span> <span class="n">P</span><span class="p">[</span><span class="n">state</span><span class="p">][</span><span class="n">a</span><span class="p">]</span> <span class="k">for</span> <span class="n">transition</span> <span class="ow">in</span> <span class="n">transitions</span><span class="p">:</span> <span class="n">prob</span><span class="p">,</span> <span class="n">nextS</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">term</span> <span class="o">=</span> <span class="n">transition</span> <span class="n">Qs</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">+=</span> <span class="n">prob</span><span class="o">*</span><span class="p">(</span><span class="n">reward</span> <span class="o">+</span> <span class="n">gamma</span><span class="o">*</span><span class="n">value_from_policy</span><span class="p">[</span><span class="n">nextS</span><span class="p">])</span> <span class="c1"># For this state </span> <span class="c1"># get maximum Q </span> <span class="n">max_as</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">Qs</span><span class="o">==</span><span class="n">Qs</span><span class="p">.</span><span class="nb">max</span><span class="p">())</span> <span class="n">max_as</span> <span class="o">=</span> <span class="n">max_as</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># Set new policy to this action that maximizes Q </span> <span class="n">new_policy</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">=</span> <span class="n">max_as</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1">############################ </span> <span class="k">return</span> <span class="n">new_policy</span> </code></pre></div></div> <p>When you run this code, each iteration will produce a policy and value function that converges to the optimum. Below is a visualization of how the value function evolves with each iteration:</p> <p><img src="/images/blog_pics/RL1/pi_vfs.gif" alt="Policy Iteration Value Function" /></p> <p>Again, information about the goal (lower-right corner) propagates back to all the other states with each iteration.</p> <p>The optimal policy is one that essentially follows increasing rewards. In the starter code you always begin at the top-left corner of the frozen, so from there the optimal policy corresponds to always moving towards light-colored squares. After policy iteration converges, you should see a sample run of your program that successfully navigates from the start to finish.</p> <h1 id="frozen-lake-value-iteration">Frozen Lake: Value Iteration</h1> <p>The code for value iteration is similar to policy iteration. First, here is a look inside a completed value_iteration function:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">value_iteration</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">nS</span><span class="p">,</span> <span class="n">nA</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">tol</span><span class="o">=</span><span class="mf">1e-3</span><span class="p">):</span> <span class="n">value_function</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nS</span><span class="p">)</span> <span class="n">policy</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nS</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span> <span class="c1">############################ </span> <span class="c1"># YOUR IMPLEMENTATION HERE # </span> <span class="c1"># Value iteration is like policy iteration above, except estimation of the value function is done by maximizing over actions </span> <span class="c1"># After the value function converges, one step is done that find the action that maximizes reward </span> <span class="n">error</span> <span class="o">=</span> <span class="mi">1</span> <span class="c1"># Iterate value function, find optimal </span> <span class="k">while</span> <span class="n">error</span> <span class="o">&gt;</span> <span class="n">tol</span><span class="p">:</span> <span class="n">new_value_function</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nS</span><span class="p">)</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nS</span><span class="p">):</span> <span class="n">Qs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nA</span><span class="p">)</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nA</span><span class="p">):</span> <span class="n">transitions</span> <span class="o">=</span> <span class="n">P</span><span class="p">[</span><span class="n">s</span><span class="p">][</span><span class="n">a</span><span class="p">]</span> <span class="k">for</span> <span class="n">transition</span> <span class="ow">in</span> <span class="n">transitions</span><span class="p">:</span> <span class="n">prob</span><span class="p">,</span> <span class="n">nextS</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">term</span> <span class="o">=</span> <span class="n">transition</span> <span class="n">Qs</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">+=</span> <span class="n">prob</span><span class="o">*</span><span class="p">(</span><span class="n">reward</span> <span class="o">+</span> <span class="n">gamma</span><span class="o">*</span><span class="n">value_function</span><span class="p">[</span><span class="n">nextS</span><span class="p">])</span> <span class="n">new_value_function</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">Qs</span><span class="p">)</span> <span class="n">diff_vf</span> <span class="o">=</span> <span class="n">new_value_function</span><span class="o">-</span><span class="n">value_function</span> <span class="n">value_function</span> <span class="o">=</span> <span class="n">new_value_function</span> <span class="n">error</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">diff_vf</span><span class="p">)</span> <span class="c1"># Get policy from value function </span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nS</span><span class="p">):</span> <span class="n">Qs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nA</span><span class="p">)</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nA</span><span class="p">):</span> <span class="n">transitions</span> <span class="o">=</span> <span class="n">P</span><span class="p">[</span><span class="n">s</span><span class="p">][</span><span class="n">a</span><span class="p">]</span> <span class="k">for</span> <span class="n">transition</span> <span class="ow">in</span> <span class="n">transitions</span><span class="p">:</span> <span class="n">prob</span><span class="p">,</span> <span class="n">nextS</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">term</span> <span class="o">=</span> <span class="n">transition</span> <span class="n">Qs</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">+=</span> <span class="n">prob</span><span class="o">*</span><span class="p">(</span><span class="n">reward</span> <span class="o">+</span> <span class="n">gamma</span><span class="o">*</span><span class="n">value_function</span><span class="p">[</span><span class="n">nextS</span><span class="p">])</span> <span class="n">max_as</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">Qs</span><span class="o">==</span><span class="n">Qs</span><span class="p">.</span><span class="nb">max</span><span class="p">())</span> <span class="n">max_as</span> <span class="o">=</span> <span class="n">max_as</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="n">policy</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="o">=</span> <span class="n">max_as</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1">############################ </span> <span class="k">return</span> <span class="n">value_function</span><span class="p">,</span> <span class="n">policy</span> </code></pre></div></div> <p>Value iteration first finds the optimal value function, and then uses this to find optimal policy. Below shows the evolution of value function for this value iteration algorithm, showing similar behavior to policy iteration above:</p> <p><img src="/images/blog_pics/RL1/vi_vfs.gif" alt="Value Iteration Value Function" /></p> <p>Notice how this value function is the same as found through policy iteration, showcasing how the optimal value is <em>unique</em>.</p> <h1 id="discussion-and-conclusion">Discussion and Conclusion</h1> <p>In this post, we talked about the basics of reinforcement learning, discussed policy iteration and value iteration as fundamental algorithms in reinforcement learning, and showed a concrete example of reinforcement learning using the popular OpenAI python package.</p> <p>When I look at the above gifs, I am reminded of the <a href="https://nrsyed.com/2017/12/30/animating-the-grassfire-path-planning-algorithm/">firegrass path-planning algorithm</a>. This is a graph-based approach that seeks to find an optimal path from any point on the graph to an end point, all while avoid certain nodes (e.g. “holes” in a frozen lake) on the graph. In grassfire, information about the goal is propagated to adjacent nodes in an iterative fashion, just like RL. Why not use something like grassfire to find the best path through the frozen lake?</p> <p>One big difference is that RL can handle stochastic environments. Although this blog post has largely focused on deterministic systems, everything we’ve talked about can easily be generalized where uncertainty plays a big part. For example, the start code also allows you to train a policy on a stochastic process. Frozen lakes are slippery, right? In the stochastic model, taking an action up, down, left or right will only result in that action occurring with <em>some probability</em>. For example, if you choose to go down, there is a 0.33 probability that you actually go down, but also a 0.33 probability that you move left and a 0.33 probability that you move right. For a 4x4 slippery lake, value iteration finds the following value function:</p> <p><img src="/images/blog_pics/RL1/vi_vfs_slippery.gif" alt="Value Iteration Value Function on Slippery Lake" /></p> <p>One thing I wish to discuss is where reinforcement learning goes from here. Grid-worlds are nice for learning, but how is RL applied to real-world examples? The first you might encounter is the fact that we don’t have a good model $p(s’|s,a)$ for real-world systems. Instead of spending time pain-stakingly finding this model for all new systems you encounter, how can a RL algorithm find the optimal policy and value function without a model? Obviously, the above algorithms can no longer be used. Instead, a lot of research has been done on model-free RL.</p> <p>Another problem that has been researched is the tendency for state-space dimensions to be <strong>huge</strong>. For example, suppose you wish to train an RL algorithm on image input. A small image is 480x640 pixels in size, and each pixel can take up to to 255x255x255 values in color. Thus there are a possible 480x640x255x255x255 = 5,093,798,400,000 states! Tackling these huge state-space dimensions are another research direction, and so far a lot of success has been found with neural networks.</p>Jacob Higginsjdh4je@virginia.eduFirst Steps in Reinforcement Learning