Chapter 6: Temporal-Difference Learning

Temporal-difference (TD) learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

6.1 TD Prediction

A simple every-visit Monte Carlo method suitable for nonstationary environments is

$$ V(S_t) \leftarrow V(S_t) + \alpha \Big[G_t - V(S_t) \Big] $$

where $G_t$ is the actual return following time $t$, and $\alpha$ is a constant step-size parameter. Let us call this method constant-$\alpha$ MC. Whereas Monte Carlo methods must wait until the end of the episode to determine the increment to $V(S_t)$ (only then $G_t$ is known), TD methods need to wait only until the next time step. At time $t+1$ they immediately form a target and make a useful update using the observed reward $R_{t+1}$ and estimate $V(S_{t+1})$. The simplest TD method makes the update

$$ V(S_t) \leftarrow V(S_t) + \alpha\Big[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\Big] \tag{6.2} $$

immediately on transition to $S_{t+1}$ and receiving $R_{t+1}$. Essentially, the target for the Monte Carlo update is $G_t$, whereas the target for the TD update is $R_{t+1} + \gamma V(S_{t+1})$. This TD method is called TD(0), or one-step TD, because it uses the value estimate one step later to approximate the full return. The box below specifies TD(0) completely in procedural form.

Because TD(0) bases its update in part of an existing estimate, we say that it is a bootstrapping method, like DP. Here, we clarify the differences between the updates of the value estimates in MC, DP and TD methods. From Chapter 4 we know that

$$ \begin{align*} v_\pi(s) &\vcentcolon= \mathbb{E}\pi[G_t|S_t=s] \tag{6.3} \\ &= \mathbb{E}\pi[R_{t+1} + \gamma G_{t+1}|S_t=s] \tag{from (3.9)} \\ &= \mathbb{E}\pi[R{t+1}+\gamma v_\pi(S_{t+1})|S_t = s]. \tag{6.4} \end{align*} $$

MC methods use an estimate of (6.3) as a target, whereas DP methods use (6.4). The MC target is an estimate because the expected value in (6.3) is not known and thus sample returns from episodes are used in place of the real expected return. The DP target is an estimate not because of the expected values, which are assumed to be completely provided under the presence of a model of the environment, but because $v_\pi(S_{t+1})$ is not known and the current array estimate, $V(S_{t+1})$, derived from the iterative policy evaluation step. The TD target is an estimate for both reasons: it samples the expected values in (6.4) and it uses the current estimate $V$ instead of the true $v_\pi$. An important thing to note here is that both DP and TD method uses bootstrapping, i.e. using value estimates of successor states to estimate value of current state, whereas MC methods do not (they use the actual sample return).

Shown to the right is the backup diagram for tabular TD(0). The value estimate for the state node at the top of the backup diagram is updated on the basis of the one sample transition from it to the immediately following state. We refer to TD and Monte Carlo updates as sample updates because they involve looking ahead to a sample successor state (or state–action pair), using the value of the successor and the reward along the way to compute a backed-up value, and then updating the value of the original state (or state–action pair) accordingly. Sample updates differ from the expected updates of DP methods in that they are based on a single sample successor rather than on a complete distribution of all possible successors.

We call the difference between the estimated value of $S_t$ and the better estimate $R_{t+1} + \gamma V(S_{t+1})$ as the TD error:

$$ \delta_t \vcentcolon= R_{t+1} + \gamma V(S_{t+1}) - V(S_t). \tag{6.5} $$

Note that if the array $V$ does not change during the episode (as it does not in Monte Carlo methods), then the Monte Carlo error can be written as a sum of TD errors:

$$ \begin{align*} G_t - V(S_t) &= R_{t+1} + \gamma G_{t+1} - V(S_t) + \gamma V(S_{t+1}) - \gamma V(S_{t+1}) \\ &= \delta_t + \gamma(G_{t+1} - V(S_{t+1})) \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2(G_{t+2} - V(S_{t+2})) \\ &= \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + \dots + \gamma^{T-t-1}S_{T-1} + \gamma^{T-t}(G_T - V(S_T)) \\ &= \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + \dots + \gamma^{T-t-1}S_{T-1} + \gamma^{T-t}(0-0) \\ &= \sum_{k=t}^{T-1}\gamma^{k-t}\delta_k \tag{6.6} \end{align*} $$

This identity is not exact if V is updated during the episode (as it is in TD(0)), but if the step size is small then it may still hold approximately.