Exploring Solutions of the Hamilton-Jacobi-Bellman Equation

9 minute read


Does the HJB equation really find the optimal control law?

In optimal control theory, the Hamilton-Jacobi-Bellman equation is a PDE that gives a necessary and sufficient condition for optimal control with respect to a cost function. In other words, if we can solve the HJB equation, then we find the optimal control law. In most examples detailing how the HJB equation is solved, the discussion stops as soon as the answer is found. This is usually fine, but I’ve always wondered if the answer we find is actually the optimal answer. Can we explore the optimality of HJB equation solutions using graphs?

In this blog post, I’ll walk through the basic steps required to find the optimal control law with the HJB equation. After, I’ll play around with optimal solution by graphing solutions that are perturbed by small amounts, showing graphically how these solutions optimize the cost function.

Solving the HJB equation

Suppose we define the following cost function:

\begin{equation} J(x(t),t) = h(x(T),u(T)) + \int_t^T g(x(\tau),u(\tau) d\tau \end{equation}

Here, $g(x,u)$ is a (usually positive definite) function that describes the instantaneous cost that $J(x,u)$ accrues at time $t$. Similarly, $h(x,u)$ describes the final cost at terminal time $T$. Notice how in this context, the cost function is a function of time $t$, and describes the future costs accrued until $T$. This is a consequence of Bellman’s principle of optimality. I’ll spare you the long explanation, as a lot of other people have already covered it in many different contexts. The basic idea, though, is if I want to achieve a certain state in an optimal way, I should work backwards from that desired state to any other feasible state. By finding an optimal strategy as you work backwards, this must be the optimal strategy if you were to move forward instead. The cost function is also a function of the current state $x(t)$.

For a given starting state $x_0$, future states evolve according to some differential equation:

\begin{equation} \dot{x} = f(x,u) \end{equation}

Optimal control theory seeks a control law $u := u(x)$ such that the cost $J$ is minimized. This is where the HJB equation comes into play. It says that if we find a control law that satisfies the following PDE:

\begin{equation} \label{eq:HJB} J^*_t(x,t) + \min_u( g(x,t) + J^*_x \cdot f(x,u) ) = 0, \end{equation}

then we find the optimal control law for the associated cost function. Notice that the HJB includes the optimal cost function $J^*(x,t)$. That is the value of the cost function starting at time $t$ and state $x(t)$, following the optimal control law $u = u^*(x)$. This PDE has the following boundary condition:

\begin{equation} J^*(x(T),T) = h(x(T),T) \end{equation}

Again because of the principle of optimality, our boundary condition is placed at the terminal time $T$.

In order to get any further, we have to now define our system and cost function. Let’s define the system dynamics. To keep things simple, I’ll stick with a very basic LTI one-dimensional system:

\begin{equation} \dot{x} = ax + bu \end{equation}

Now let’s define a reasonable cost function as:

\begin{equation} J(x(t),t) = 0.5kx(T) + \int_t^T cx(\tau)^2 + du(\tau)^2 d\tau \end{equation}

The coefficients $a$, $b$, $c$, $d$ and $k$ are kept general for now. With these definitions, we can start to actually solve the HJB equation.

First, lets focus on the following term:

\begin{equation*} \min_u{ g(x,t) + J^*_x \cdot f(x,u) } \end{equation*}

Replacing $g(x,t)$ and $f(x,u)$ with how they’re defined above, we get:

\begin{equation*} \min_u\left( cx^2 + du^2 + J^*_x \cdot (ax + bu) \right) \end{equation*}

Because $J^*$ is the optimal cost function, it is associated with the optimal control $u^*$ and in no way can depend on the yet-specified control $u$. Thus, the term $J^*_x$ cannot depend on $u$. Minimization is performed by finding where the derivative of the term is zero:

\begin{equation*} 2du + bJ^*_x = 0 \end{equation*}

Rearranging gives the following relationship:

\begin{equation*} \label{eq:optimal_control} u^*(x) = -\frac{b}{2d}J^*_x \end{equation*}

This is the control law that optimizes the cost, but note that it’s in terms of the cost function! To get the optimal control, we need to cost function. We do this by substituting Eq. \ref{eq:optimal_control} into Eq. \ref{eq:HJB}, we get the following PDE:

\begin{equation} J^*_t + cx^2 - \frac{b^2}{4d}(J^*_x)^2 + + axJ^*_x + cx^2 = 0 \end{equation}

Like any other PDE, one approach is to guess that the solution $J^*$ has a specific form, then check to see if that’s the solution. Once we find a solution, uniqueness theorem says that it must be the solution. We know that the final cost is $J^*(x(T),T) = 0.5kx^2(T)$, so one reasonable guess is check solutions of the following form:

\begin{equation} J^*(x,t) = 0.5K(t)x^2 \end{equation}

This way, our boundary term becomes $K(T)=k$. Performing the various partials and plugging in gives an ODE to solve:

\begin{equation} \dot{K} = \frac{b^2}{4d}K^2 - aK - c \end{equation}

Notice that the $x^2$ terms dropped out, resulting in an first order ODE that solves for one variable $K(t)$ over time. This ODE can be solved using separation of variables. First, rearrange the derivative so that the differentials are on opposite sides of the equality:

\begin{equation} \frac{dK}{\frac{b^2}{4d}K^2 - aK - c} = dt \end{equation}

Next, break up the bottom quadratic into two roots: $\frac{b^2}{4d}K^2 - aK -c = (K+z_1)(K+z_2)$. I won’t give the general expression for the roots here, since they can easily be solved in most languages. But, with this we can break up the fraction as so:

\begin{equation} \frac{dK}{\frac{b^2}{4d}K^2 - aK - c} = dK\left( \frac{1}{K+z_1}\frac{1}{K+z_2} \right) = dK\left( \frac{c_1}{K+z_1} + \frac{c_2}{K+z_2} \right) \end{equation}

The constants $c_1$ and $c_2$ are found by combining the two fractions in the final equality so we get the same denominator as the middle equality, and solving so we get the same numerator in the middle equality as well. The constants turn out to be $c_1 = 1/(z_1 - z_2)$ and $c_2 = 1/(z_2 + z_1)$. Let $c=c_2=c_1$ So the integration we perform is:

\begin{equation} \int_{K(t)}^{K(T)} \left( \frac{c}{K+z_1} - \frac{c}{K+z_2} \right) dK = \int_t^T dt \end{equation}

Doing this results in the final expression for $K(t)$:

\begin{equation} K(t) = \frac{z_2(K(T)+z_1)e^{(T-t)/c} - z_1(K(T)+z_2)}{K(T) + z_2 - (K(T) + z_1)e^{(T-t)/c}} \end{equation}

Now that we have this, our problem is solved, since we have the optimal control $J^*=0.5K(t)x^2$, and with the optimal control we can get $u^*=-\frac{b}{2d}J_x^*=-\frac{b}{2d}K(t)x(t)$.

Our problem is solved, and like I said in the introduction, this is where most problems stop. I want to go a step further and graphically motivate this is, in fact, the policy that minimizes the cost function. Let’s assume the following values for the constants we’ve been working with: $a=-10$, $b=1$, $c=0.25$, $d=0.5$ and terminal condition $K(T)=1$. Also, let us assume that $T=1$ second. First, let us see the motion of system under this control with $x(t=0)=1$:

Now let’s introduce a small perturbation to the control policy. Recall that our cost function had a terminal boundary condition $J(x(T),T) = h(x(T),T)$ at $T$ seconds into the future. When perturbing our control, we have to make sure that the perturbation respects this boundary condition. The reason is that boundary conditions are an assumption made when solving a PDE or ODE, and are required to “pin” a solution down in the infinite solution space of the differential equation. Solving the differential equations gives us an specific answer only if we provide boundary conditions. Changing the boundary conditions would mean we’re changing the problem we’re solving, and comparing those solutions would be like comparing apples and oranges.

To this end, let us choose a simple perturbation that is zero at $t=T$:

\begin{equation} K’(t) = K(t) + \epsilon (T-t) \end{equation}

The control is found using $K’(t)$ and the cost is found using the unperturbed $K(t)$. This is because we want to see if different controller is actually more optimal for the same cost function. Plotting the cost over perturbation $\epsilon$ results in the following graph:

Clearly, this cost is minimized when $\epsilon = 0$! What does the control value and motion look like over time with this perturbation? Here is a graph with $\epsilon = 2$:

Notice that by giving $K(t)$ a positive perturbation, we make the control have a bigger magnitude (more negative), but this also means that the state variable approaches $x=0$ faster. This give-and-take ultimately leads to a higher cost. The HJB gives us the cost that perfectly balances between these two competing objectives.

What if we instead perturbed the control by a more drastic function:

\begin{equation} K’(t) = K(t) + \epsilon\left[ (T-t) + (T-t)^2 + (T-t)^3 + (T-t)^4 + (T-t)^5 \right] \end{equation}

Here is the result cost over parameter $\epsilon$:

Again, we find the same thing. Instead of perturbing the function $K(t)$, what if we instead perturbed the control directly? Like so:

\begin{equation} u(x,t) = u^*(x,t) + \epsilon\left[ (T-t) + (T-t)^2 + (T-t)^3 + (T-t)^4 + (T-t)^5 \right] \end{equation}

In this way, we’re still respecting the terminal boundary condition. This perturbation produces the following graph:


These graphs are pretty cool, but one question is if this proves the HJB results in an optimal control policy. The answer is no, these graphs don’t and can’t prove optimality since there an infinite number of perturbations to try. The proof in optimality ultimately lies in the derivation of the HJB to begin with. Nevertheless, these graphs are a really nice way to building an intuition behind the results of the HJB.