Discrete Dynamic Programming

Overview

In this lecture we discuss a family of dynamic programming problems with the following features:

  1. a discrete state space and discrete choices (actions)
  2. an infinite horizon
  3. discounted rewards
  4. Markov state transitions

We call such problems discrete dynamic programs, or discrete DPs

Discrete DPs are the workhorses in much of modern quantitative economics, including

  • monetary economics
  • search and labor economics
  • household savings and consumption theory
  • investment theory
  • asset pricing
  • industrial organization, etc.

When a given model is not inherently discrete, it is common to replace it with a discretized version in order to use discrete DP techniques

This lecture covers

  • the theory of dynamic programming in a discrete setting, plus examples and applications
  • a powerful set of routines for solving discrete DPs from the QuantEcon code libary

How to Read this Lecture

We have used dynamic programming in a number of earlier lectures, such as

Here we shift to a more systematic and theoretical treatment, including algorithms and implementation

The code discussed below was authored primarily by Daisuke Oyama

References

For background reading on dynamic programming and additional applications, see, for example,

Discrete DPs

Loosely speaking, a discrete DP is a maximization problem with an objective function of the form

(1)\[\mathbb{E} \sum_{t = 0}^{\infty} \beta^t r(s_t, a_t)\]

where

  • \(s_t\) is the state variable
  • \(a_t\) is the action
  • \(\beta\) is a discount factor
  • \(r(s_t, a_t)\) is interpreted as a current reward when the state is \(s_t\) and the action chosen is \(a_t\)

Each pair \((s_t, a_t)\) pins down transition probabilities \(Q(s_t, a_t, s_{t+1})\) for the next period state \(s_{t+1}\)

Thus, actions influence not only current rewards but also the future time path of the state

The essence of dynamic programming problems is to trade off current rewards vs favorable positioning of the future state (modulo randomness)

Examples:

  • consuming today vs saving and accumulating assets
  • accepting a job offer today vs seeking a better one in the future
  • exercising an option now vs waiting

Policies

The most fruitful way to think about solutions to discrete DP problems is to compare policies

In general, a policy is a randomized map from past actions and states to current action

In the setting formalized below, it suffices to consider so-called stationary Markov policies, which consider only the current state

In particular, a stationary Markov policy is a map \(\sigma\) from states to actions

  • \(a_t = \sigma(s_t)\) indicates that \(a_t\) is the action to be taken in state \(s_t\)

It is known that, for any arbitrary policy, there exists a stationary Markov policy that dominates it at least weakly

  • See section 5.5 of [Put05] for discussion and proofs

In what follows, stationary Markov policies are referred to simply as policies

The aim is to find an optimal policy, in the sense of one that maximizes (1)

Let’s now step through these ideas more carefully

Formal definition

Formally, a discrete dynamic program consists of the following components:

  1. A finite set of states \(S = \{0, \ldots, n-1\}\)
  2. A finite set of feasible actions \(A(s)\) for each state \(s \in S\), and a corresponding set
\[ \mathit{SA} := \{(s, a) \mid s \in S, \; a \in A(s)\} of *feasible state-action pairs*\]
  1. A reward function \(r\colon \mathit{SA} \to \mathbb{R}\)
  2. A transition probability function \(Q\colon \mathit{SA} \to \Delta(S)\), where \(\Delta(S)\) is the set of probability distributions over \(S\)
  3. A discount factor \(\beta \in [0, 1)\)

We also use the notation \(A := \bigcup_{s \in S} A(s) = \{0, \ldots, m-1\}\) and call this set the action space

A policy is a function \(\sigma\colon S \to A\)

A policy is called feasible if it satisfies \(\sigma(s) \in A(s)\) for all \(s \in S\)

Denote the set of all feasible policies by \(\Sigma\)

If a decision maker uses a policy \(\sigma \in \Sigma\), then

  • the current reward at time \(t\) is \(r(s_t, \sigma(s_t))\)
  • the probability that \(s_{t+1} = s'\) is \(Q(s_t, \sigma(s_t), s')\)

For each \(\sigma \in \Sigma\), define

  • \(r_{\sigma}\) by \(r_{\sigma}(s) := r(s, \sigma(s))\))
  • \(Q_{\sigma}\) by \(Q_{\sigma}(s, s') := Q(s, \sigma(s), s')\)

Notice that \(Q_\sigma\) is a stochastic matrix on \(S\)

It gives transition probabilities of the controlled chain when we follow policy \(\sigma\)

If we think of \(r_\sigma\) as a column vector, then so is \(Q_\sigma^t r_\sigma\), and the \(s\)-th row of the latter has the interpretation

(2)\[(Q_\sigma^t r_\sigma)(s) = \mathbb E [ r(s_t, \sigma(s_t)) \mid s_0 = s ] \quad \text{when } \{s_t\} \sim Q_\sigma\]

Comments

  • \(\{s_t\} \sim Q_\sigma\) means that the state is generated by stochastic matrix \(Q_\sigma\)
  • See this discussion on computing expectations of Markov chains for an explanation of the expression in (2)

Notice that we’re not really distinguishing between functions from \(S\) to \(\mathbb R\) and vectors in \(\mathbb R^n\)

This is natural because they are in one to one correspondence

Value and Optimality

Let \(v_{\sigma}(s)\) denote the discounted sum of expected reward flows from policy \(\sigma\) when the initial state is \(s\)

To calculate this quantity we pass the expectation through the sum in (1) and use (2) to get

\[v_{\sigma}(s) = \sum_{t=0}^{\infty} \beta^t (Q_{\sigma}^t r_{\sigma})(s) \qquad (s \in S)\]

This function is called the policy value function for the policy \(\sigma\)

The optimal value function, or simply value function, is the function \(v^*\colon S \to \mathbb{R}\) defined by

\[v^*(s) = \max_{\sigma \in \Sigma} v_{\sigma}(s) \qquad (s \in S)\]

(We can use max rather than sup here because the domain is a finite set)

A policy \(\sigma \in \Sigma\) is called optimal if \(v_{\sigma}(s) = v^*(s)\) for all \(s \in S\)

Given any \(w \colon S \to \mathbb R\), a policy \(\sigma \in \Sigma\) is called \(w\)-greedy if

\[\sigma(s) \in \operatorname*{arg\,max}_{a \in A(s)} \left\{ r(s, a) + \beta \sum_{s' \in S} w(s') Q(s, a, s') \right\} \qquad (s \in S)\]

As discussed in detail below, optimal policies are precisely those that are \(v^*\)-greedy

Two Operators

It is useful to define the following operators:

  • The Bellman operator \(T\colon \mathbb{R}^S \to \mathbb{R}^S\) is defined by
\[(T v)(s) = \max_{a \in A(s)} \left\{ r(s, a) + \beta \sum_{s' \in S} v(s') Q(s, a, s') \right\} \qquad (s \in S)\]
  • For any policy function \(\sigma \in \Sigma\), the operator \(T_{\sigma}\colon \mathbb{R}^S \to \mathbb{R}^S\) is defined by
\[(T_{\sigma} v)(s) = r(s, \sigma(s)) + \beta \sum_{s' \in S} v(s') Q(s, \sigma(s), s') \qquad (s \in S)\]

This can be written more succinctly in operator notation as

\[T_{\sigma} v = r_{\sigma} + \beta Q_{\sigma} v\]

The two operators are both monotone

  • \(v \leq w\) implies \(Tv \leq Tw\) pointwise on \(S\), and similarly for \(T_\sigma\)

They are also contraction mappings with modulus \(\beta\)

  • \(\lVert Tv - Tw \rVert \leq \beta \lVert v - w \rVert\) and similarly for \(T_\sigma\), where \(\lVert \cdot\rVert\) is the max norm

For any policy \(\sigma\), its value \(v_{\sigma}\) is the unique fixed point of \(T_{\sigma}\)

For proofs of these results and those in the next section, see, for example, EDTC, chapter 10

The Bellman Equation and the Principle of Optimality

The main principle of the theory of dynamic programming is that

  • the optimal value function \(v^*\) is a unique solution to the Bellman equation,

    \[v(s) = \max_{a \in A(s)} \left\{ r(s, a) + \beta \sum_{s' \in S} v(s') Q(s, a, s') \right\} \qquad (s \in S),\]

    or in other words, \(v^*\) is the unique fixed point of \(T\), and

  • \(\sigma^*\) is an optimal policy function if and only if it is \(v^*\)-greedy

By the definition of greedy policies given above, this means that

\[\sigma^*(s) \in \operatorname*{arg\,max}_{a \in A(s)} \left\{ r(s, a) + \beta \sum_{s' \in S} v^*(s') Q(s, \sigma(s), s') \right\} \qquad (s \in S)\]

Solving Discrete DPs

Now that the theory has been set out, let’s turn to solution methods

Code for solving dicrete DPs is available in ddp.jl from the QuantEcon.jl code library

It implements the three most important solution methods for discrete dynamic programs, namely

  • value function iteration
  • policy function iteration
  • modified policy function iteration

Let’s briefly review these algorithms and their implementation

Value Function Iteration

Perhaps the most familiar method for solving all manner of dynamic programs is value function iteration

This algorithm uses the fact that the Bellman operator \(T\) is a contraction mapping with fixed point \(v^*\)

Hence, iterative application of \(T\) to any initial function \(v^0 \colon S \to \mathbb R\) converges to \(v^*\)

The details of the algorithm can be found in the appendix

Policy Function Iteration

This routine, also known as Howard’s policy improvement algorithm, exploits more closely the particular structure of a discrete DP problem

Each iteration consists of

  1. A policy evaluation step that computes the value \(v_{\sigma}\) of a policy \(\sigma\) by solving the linear equation \(v = T_{\sigma} v\)
  2. A policy improvement step that computes a \(v_{\sigma}\)-greedy policy

In the current setting policy iteration computes an exact optimal policy in finitely many iterations

  • See theorem 10.2.6 of EDTC for a proof

The details of the algorithm can be found in the appendix

Modified Policy Function Iteration

Modified policy iteration replaces the policy evaluation step in policy iteration with “partial policy evaluation”

The latter computes an approximation to the value of a policy \(\sigma\) by iterating \(T_{\sigma}\) for a specified number of times

This approach can be useful when the state space is very large and the linear system in the policy evaluation step of policy iteration is correspondingly difficult to solve

The details of the algorithm can be found in the appendix

Example: A Growth Model

Let’s consider a simple consumption-saving model

A single household either consumes or stores its own output of a single consumption good

The household starts each period with current stock \(s\)

Next, the household chooses a quantity \(a\) to store and consumes \(c = s - a\)

  • Storage is limited by a global upper bound \(M\)
  • Flow utility is \(u(c) = c^{\alpha}\)

Output is drawn from a discrete uniform distribution on \(\{0, \ldots, B\}\)

The next period stock is therefore

\[s' = a + U \quad \text{where} \quad U \sim U[0, \ldots, B]\]

The discount factor is \(\beta \in [0, 1)\)

Discrete DP Representation

We want to represent this model in the format of a discrete dynamic program

To this end, we take

  • the state variable to be the stock \(s\)

  • the state space to be \(S = \{0, \ldots, M + B\}\)

    • hence \(n = M + B + 1\)
  • the action to be the storage quantity \(a\)

  • the set of feasible actions at \(s\) to be \(A(s) = \{0, \ldots, \min\{s, M\}\}\)

    • hence \(A = \{0, \ldots, M\}\) and \(m = M + 1\)
  • the reward function to be \(r(s, a) = u(s - a)\)

  • the transition probabilities to be

(3)\[\begin{split}Q(s, a, s') := \begin{cases} \frac{1}{B + 1} & \text{if } a \leq s' \leq a + B \\ 0 & \text{ otherwise} \end{cases}\end{split}\]

Defining a DiscreteDP Instance

This information will be used to create an instance of DiscreteDP by passing the following information

  1. An \(n \times m\) reward array \(R\)
  2. An \(n \times m \times n\) transition probability array \(Q\)
  3. A discount factor \(\beta\)

For \(R\) we set \(R[s, a] = u(s - a)\) if \(a \leq s\) and \(-\infty\) otherwise

For \(Q\) we follow the rule in (3)

Note:

  • The feasibility constraint is embedded into \(R\) by setting \(R[s, a] = -\infty\) for \(a \notin A(s)\)
  • Probability distributions for \((s, a)\) with \(a \notin A(s)\) can be arbitrary

A simple type that sets up these objects for us in the current application can be found in

For convenience let’s repeat it here:

type SimpleOG
    B :: Int64
    M :: Int64
    alpha :: Float64
    beta :: Float64
    R :: Array{Float64}
    Q :: Array{Float64}
end

function SimpleOG(;B=10, M=5, alpha=0.5, beta=0.9)

    u(c) = c^alpha
    n = B + M + 1
    m = M + 1

    R = Array(Float64,n,m)
    Q = zeros(Float64,n,m,n)

    for a in 0:M
        Q[:, a + 1, (a:(a + B)) + 1] = 1 / (B + 1)
        for s in 0:(B + M)
            R[s + 1, a + 1] = a<=s ? u(s - a) : -Inf
        end
    end
    
    return SimpleOG(B, M, alpha, beta, R, Q)
end

Let’s run this code and create an instance of SimpleOG

julia> include("finite_dp_og_example.jl")
SimpleOG

julia> g = SimpleOG();

Instances of DiscreteDP are created using the signature DiscreteDP(R, Q, beta)

Let’s create an instance using the objects stored in g

julia> using QuantEcon

julia> ddp = DiscreteDP(g.R, g.Q, g.beta);

Now that we have an instance ddp of DiscreteDP we can solve it as follows

julia> results = solve(ddp, PFI);

Let’s see what we’ve got here

julia> fieldnames(results)
5-element Array{Symbol,1}:
 :v
 :Tv
 :num_iter
 :sigma
 :mc

The most important attributes are v, the value function, and sigma, the optimal policy

julia> results.v
16-element Array{Float64,1}:
 19.0174
 20.0174
 20.4316
 20.7495
 21.0408
 21.3087
 21.5448
 21.7693
 21.9827
 22.1882
 22.3845
 22.5781
 22.7611
 22.9438
 23.1153
 23.2776

julia> results.sigma - 1
16-element Array{Int64,1}:
 0
 0
 0
 0
 1
 1
 1
 2
 2
 3
 3
 4
 5
 5
 5
 5

Here 1 is subtracted from results.sigma because we added 1 to each state and action to create valid indices

Since we’ve used policy iteration, these results will be exact unless we hit the iteration bound max_iter

Let’s make sure this didn’t happen

julia> results.num_iter
3

In this case we converged in only 3 iterations

Another interesting object is results.mc, which is the controlled chain defined by \(Q_{\sigma^*}\), where \(\sigma^*\) is the optimal policy

In other words, it gives the dynamics of the state when the agent follows the optimal policy

Since this object is an instance of MarkovChain from QuantEcon.jl (see this lecture for more discussion), we can easily simulate it, compute its stationary distribution and so on

julia> stationary_distributions(results.mc)[1]
16-element Array{Float64,1}:
 0.0173219
 0.0412106
 0.0577396
 0.0742685
 0.0809582
 0.0909091
 0.0909091
 0.0909091
 0.0909091
 0.0909091
 0.0909091
 0.0735872
 0.0496985
 0.0331695
 0.0166406
 0.00995086

Here’s the same information in a bar graph

../_images/finite_dp_simple_og.png

What happens if the agent is more patient?

julia> g_2 = SimpleOG(beta=0.99);

julia> ddp_2 = DiscreteDP(g_2.R, g_2.Q, g_2.beta);

julia> results_2 = solve(ddp_2, PFI);

julia> stationary_distributions(results_2.mc)[1]
16-element Array{Float64,1}:
 0.00546913
 0.0232134
 0.0314779
 0.0480068
 0.0562713
 0.0909091
 0.0909091
 0.0909091
 0.0909091
 0.0909091
 0.0909091
 0.08544
 0.0676957
 0.0594312
 0.0429023
 0.0346378

If we look at the bar graph we can see the rightward shift in probability mass

../_images/finite_dp_simple_og2.png

State-Action Pair Formulation

The DiscreteDP type in fact provides a second interface to setting up an instance

One of the advantages of this alternative set up is that it permits use of a sparse matrix for Q

(An example of using sparse matrices is given in the exercise solution notebook below)

The call signature of the second formulation is DiscreteDP(R, Q, beta, s_indices, a_indices) where

  • s_indices and a_indices are arrays of equal length L enumerating all feasible state-action pairs
  • R is an array of length L giving corresponding rewards
  • Q is an L x n transition probability array

Here’s how we could set up these objects for the preceding example

using QuantEcon 

B = 10
M = 5
alpha = 0.5
beta = 0.9
u(c) = c^alpha
n = B + M + 1
m = M + 1

s_indices = Int64[]
a_indices = Int64[]
Q = Array(Float64, 0, n)
R = Float64[]

b = 1.0 / (B + 1)

for s in 0:(M + B)
    for a in 0:min(M, s)
        s_indices = [s_indices; s + 1]
        a_indices = [a_indices; a + 1]
        q = zeros(Float64, 1, n)
        q[(a + 1):((a + B) + 1)] = b 
        Q = [Q; q]
        R = [R; u(s-a)]
    end
end

ddp = DiscreteDP(R, Q, beta, s_indices, a_indices);
results = solve(ddp, PFI)

Exercises

In the deterministic optimal growth dynamic programming lecture, we solved a benchmark model that has an analytical solution to check we could replicate it numerically

The exercise is to replicate this solution using DiscreteDP

Appendix: Algorithms

This appendix covers the details of the solution algorithms implemented for DiscreteDP

We will make use of the following notions of approximate optimality:

  • For \(\varepsilon > 0\), \(v\) is called an \(\varepsilon\)-approximation of \(v^*\) if \(\lVert v - v^*\rVert < \varepsilon\)
  • A policy \(\sigma \in \Sigma\) is called \(\varepsilon\)-optimal if \(v_{\sigma}\) is an \(\varepsilon\)-approximation of \(v^*\)

Value Iteration

The DiscreteDP value iteration method implements value function iteration as follows

  1. Choose any \(v^0 \in \mathbb{R}^n\), and specify \(\varepsilon > 0\); set \(i = 0\)
  2. Compute \(v^{i+1} = T v^i\)
  3. If \(\lVert v^{i+1} - v^i\rVert < [(1 - \beta) / (2\beta)] \varepsilon\), then go to step 4; otherwise, set \(i = i + 1\) and go to step 2
  4. Compute a \(v^{i+1}\)-greedy policy \(\sigma\), and return \(v^{i+1}\) and \(\sigma\)

Given \(\varepsilon > 0\), the value iteration algorithm

  • terminates in a finite number of iterations
  • returns an \(\varepsilon/2\)-approximation of the optimal value funciton and an \(\varepsilon\)-optimal policy function (unless iter_max is reached)

(While not explicit, in the actual implementation each algorithm is terminated if the number of iterations reaches iter_max)

Policy Iteration

The DiscreteDP policy iteration method runs as follows

  1. Choose any \(v^0 \in \mathbb{R}^n\) and compute a \(v^0\)-greedy policy \(\sigma^0\); set \(i = 0\)
  2. Compute the value \(v_{\sigma^i}\) by solving the equation \(v = T_{\sigma^i} v\)
  3. Compute a \(v_{\sigma^i}\)-greedy policy \(\sigma^{i+1}\); let \(\sigma^{i+1} = \sigma^i\) if possible
  4. If \(\sigma^{i+1} = \sigma^i\), then return \(v_{\sigma^i}\) and \(\sigma^{i+1}\); otherwise, set \(i = i + 1\) and go to step 2

The policy iteration algorithm terminates in a finite number of iterations

It returns an optimal value function and an optimal policy function (unless iter_max is reached)

Modified Policy Iteration

The DiscreteDP modified policy iteration method runs as follows:

  1. Choose any \(v^0 \in \mathbb{R}^n\), and specify \(\varepsilon > 0\) and \(k \geq 0\); set \(i = 0\)
  2. Compute a \(v^i\)-greedy policy \(\sigma^{i+1}\); let \(\sigma^{i+1} = \sigma^i\) if possible (for \(i \geq 1\))
  3. Compute \(u = T v^i\) (\(= T_{\sigma^{i+1}} v^i\)). If \(\mathrm{span}(u - v^i) < [(1 - \beta) / \beta] \varepsilon\), then go to step 5; otherwise go to step 4
    • Span is defined by \(\mathrm{span}(z) = \max(z) - \min(z)\)
  4. Compute \(v^{i+1} = (T_{\sigma^{i+1}})^k u\) (\(= (T_{\sigma^{i+1}})^{k+1} v^i\)); set \(i = i + 1\) and go to step 2
  5. Return \(v = u + [\beta / (1 - \beta)] [(\min(u - v^i) + \max(u - v^i)) / 2] \mathbf{1}\) and \(\sigma_{i+1}\)

Given \(\varepsilon > 0\), provided that \(v^0\) is such that \(T v^0 \geq v^0\), the modified policy iteration algorithm terminates in a finite number of iterations

It returns an \(\varepsilon/2\)-approximation of the optimal value funciton and an \(\varepsilon\)-optimal policy function (unless iter_max is reached).

See also the documentation for DiscreteDP