Chapter 1: Introduction

Reinforcement learning (RL) is learning what to do, i.e. how to map situations to actions, so as to maximize a numerical reward signal. In the most challenging cases, actions may affect not only the immediate reward, but also subsequent rewards. These two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of RL.

Examples

A good way to understand reinforcement learning is to consider some of the examples and possible applications that have guided its development.

A mobile robot decides whether it should enter a new room in search of more trash to collect or start trying to find its way back to its battery recharging station. It makes its decision based on the current charge level of its battery and how quickly and easily it has been able to find the recharger in the past.
A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 miles per hour.
A master chess player makes a move. The choice is informed both by planning— anticipating possible replies and counterreplies—and by immediate, intuitive judgments of the desirability of particular positions and moves.

These examples share features that are so basic that they are easy to overlook. All involve interaction between an active decision-making agent and its environment, within which the agent seeks to achieve a goal despite uncertainty about its environment. The agent’s actions are permitted to affect the future state of the environment (e.g., the next chess position, the robot’s next location and the future charge level of its battery), thereby affecting the actions and opportunities available to the agent at later times. Correct choice requires taking into account indirect, delayed consequences of actions, and thus may require foresight or planning.

Elements of Reinforcement Learning

Four main subelements of a RL system

policy
reward signal
value function, and
optionally, a model of the environment

A policy defines the learning agent’s way of behaving at a given time. A policy is a mapping from perceived states of the environment to actions to be taken when in those states. In general, policies may be stochastic, specifying probabilities for each action.

A reward signal defines the goal of a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number called the reward. The agent’s sole objective is to maximize the total reward it receives over the long run. The reward signal thus defines what are the good and bad events for the agent. In general, reward signals may be stochastic functions of the state of the environment and the actions taken.