Reinforcement learning (RL) is learning what to do, i.e. how to map situations to actions, so as to maximize a numerical reward signal. In the most challenging cases, actions may affect not only the immediate reward, but also subsequent rewards. These two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of RL.
A good way to understand reinforcement learning is to consider some of the examples and possible applications that have guided its development.
These examples share features that are so basic that they are easy to overlook. All involve interaction between an active decision-making agent and its environment, within which the agent seeks to achieve a goal despite uncertainty about its environment. The agent’s actions are permitted to affect the future state of the environment (e.g., the next chess position, the robot’s next location and the future charge level of its battery), thereby affecting the actions and opportunities available to the agent at later times. Correct choice requires taking into account indirect, delayed consequences of actions, and thus may require foresight or planning.
Four main subelements of a RL system
A policy defines the learning agent’s way of behaving at a given time. A policy is a mapping from perceived states of the environment to actions to be taken when in those states. In general, policies may be stochastic, specifying probabilities for each action.
A reward signal defines the goal of a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number called the reward. The agent’s sole objective is to maximize the total reward it receives over the long run. The reward signal thus defines what are the good and bad events for the agent. In general, reward signals may be stochastic functions of the state of the environment and the actions taken.