In this chapter we introduce the formal problem of finite Markov decision processes, or finite MDPs, which we try to solve in the rest of the book. MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward and the need to trade off immediate and delayed reward. Whereas in bandit problems we estimated the value $q_(a)$ of each action $a$, in MDPs we estimate the value of $q_(s,a)$ of each action $a$ in each state $s$, or we estimate the value $v_*(s)$ of each state given optimal action selections.
MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent. The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.
More specifically, the agent and environment interact at each a sequence of discrete time steps, $t=0,1,2,3,\dots$. At each time step $t$, the agent receives some representation of the environment’s state, $S_t \in \mathcal{S}$, and on that basis selects an action, $A_t \in \mathcal{A}(s)$. One time step later, in part as a consequence of its action, the agent receives a numerical reward, $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$, and finds itself in a new state, $S_{t+1}$. The MDP and agent together thereby give rise to a sequence or trajectory that begins like this:
$$ S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots \tag{3.1} $$
In a finite MDP, the sets of states, actions, and rewards ($\mathcal{S}$, $\mathcal{A}$, and $\mathcal{R}$) all have a finite number of elements. In this case, the random variables $R_t$ and $S_t$ have well defined discrete probability distributions dependent only on the preceding state and action. That is, for particular values of these random variables, $s’ \in \mathcal{S}$ and $r \in \mathcal{R}$, there is a probability of those values occurring at time $t$, given particular values of the preceding state and action:
$$ p(s', r|s, a) \vcentcolon= \text{Pr}\{S_t = s', R_t = r|S_{t-1}=S, A_{t-1} = a\}, \tag{3.2} $$
for $s’, s \in \mathcal{S}$, $r \in \mathcal{R}$ and $a \in \mathcal{A}(s)$. The function $p$ defines the dynamics of the MDP. The dynamics function $p \vcentcolon \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \rightarrow [0,1]$ is an ordinary deterministic function of four arguments.
$$ \sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s', r|s, a) = 1, \text{for all }s \in \mathcal{S}, a \in \mathcal{A}(s). \tag{3.3} $$
In a Markov decision process, the probabilities given by $p$ completely characterize the environment’s dynamics. That is, the probability of each possible value for $S_t$ and $R_t$ depends on the immediately preceding state and action, $S_{t-1}$ and $A_{t-1}$, and not on earlier states and actions. This is best viewed as a restriction not on the decision process, but on the state. The state must include information about all aspects of the past agent–environment interaction that make a difference for the future. If it does, then the state is said to have the Markov property.
From the four-argument dynamics function, p, one can compute anything else one might want to know about the environment, such as the state-transition probabilities,
$$ p(s' |s,a) = \text{Pr}\{S_t = s'|S_{t-1} = s, A_{t-1} =a\} = \sum_{r \in \mathcal{R}}p(s', r|s,a). \tag{3.4} $$
We can also compute the expected rewards for state–action pairs as a two-argument function $r \vcentcolon \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$:
$$ r(s,a) = \mathbb{E}[R_t|S_{t-1}=s, A_{t-1} =a] = \sum_{r \in \mathcal{R}}r\sum_{s' \in \mathcal{S}}p(s', r|s,a), \tag{3.5} $$
and the expected rewards for state–action–next-state triples as a three-argument function $r \vcentcolon \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$,
$$ \begin{align*} r(s,a,s') &= \mathbb{E}[R_t|S_{t-1}=s, A_{t-1} =a, S_{t} =s'] \\ &= \sum_{r \in \mathcal{R}}r\frac{p(s', r|s,a)}{\sum_{r\in\mathcal{R}}p(s', r|s,a)} \\ &= \sum_{r \in \mathcal{R}}r\frac{p(s', r|s,a)}{p(s'|s,a)}. \tag{3.6} \end{align*} $$
The MDP framework is abstract and flexible and can be applied to many different problems in many di↵erent ways. For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary successive stages of decision making and acting. The actions can be low-level controls, such as the voltages applied to the motors of a robot arm, or high-level decisions, such as whether or not to have lunch or to go to graduate school. Similarly, the states can take a wide variety of forms. They can be completely determined by low-level sensations, such as direct sensor readings, or they can be more high-level and abstract, such as symbolic descriptions of objects in a room.