The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically. In fact, modern reinforcement learning methods can be viewed as attempts to achieve much the same effect as DP, only with less computation and without assuming a perfect model of the environment.
We note one special property of DP methods. All of them update estimates of the values of states based on estimates of the values of successor states. That is, they update estimates on the basis of other estimates. We call this general idea bootstrapping. Many reinforcement learning methods perform bootstrapping, even those that do not require, as DP requires, a complete and accurate model of the environment.
Policy evaluation refers to the process of computing the state-value functions $v_\pi$ for an arbitrary policy $\pi$. We also refer to it as the prediction problem.
Recall from Chapter 3 that, for all $s\in \mathcal{S}$,
$$ \begin{align*} v_\pi(s) &\vcentcolon= \mathbb{E}\pi[G_t | S_t = s] \\ &= \mathbb{E}\pi[R_{t+1} + \gamma G_{t+1} | S_t = s] \tag{from (3.9)}\\ &= \mathbb{E}\pi[R{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s] \tag{4.3} \\ &= \sum_{a}\pi(a|s)\sum_{s',r}p(s', r|s,a)\Big[ r + \gamma v_\pi(s') \Big] \tag{4.4} \end{align*} $$
In reality, the environment dynamics are often not known and therefore we can only approximate the value function. Let’s consider a sequence of approximate value functions $v_0, v_1, v_2, \dots$, each mapping $\mathcal{S}^+$ (this is $\mathcal{S}$ with terminal states for episodic tasks) to $\mathbb{R}$. The initial approximation, $v_0$, is chosen arbitrarily (except that the terminal state, if any, must be given value 0), and each successive approximation is obtained by using the Bellman equation for $v_\pi$ (4.4) as an update rule:
$$ \begin{align*} v_{k+1}(s) &\vcentcolon= \mathbb{E}\pi[R{t+1} + \gamma v_k(S_{t+1})|S_t = s] \\ &= \sum_a\pi(a|s) \sum_{s',r} p(s',r |s,a)\Big[r + \gamma v_k(s')\Big] \tag{4.5} \end{align*} $$
for all $s \in \mathcal{S}$. The sequence $\{v_k\}$ converges to $v_\pi$ as $k \rightarrow \infty$. This algorithm is called iterative policy evaluation. To produce each successive approximation, $v_{k+1}$ from $v_k$, iterative policy evaluation applies the same operation to each state $s$: it replaces the old value of $s$ with a new value obtained from the old values of the successor states of $s$, along all one-step transitions possible under the policy being evaluated. We call this kind of operation an expected update.
Exercise 4.1 In Example 4.1, if $\pi$ is the equiprobable random policy, what is $q_\pi(11,\texttt{down})$? What is $q_\pi(7,\texttt{down})$?