Imitation learning is a class of methods that reproduces desired behavior based on expert demonstrations. Thus, imitation learning is a technique that enables skills to be transferred from human experts to robotic systems

image.png

Key Differences between Imitation Learning and Supervised Learning

The imitation learning problem has special properties that distinguish it from the supervised learning setting:

  1. the solution may have important structural properties including constraints (for example, robot joint limits), dynamic smoothness and stability, or leading to a coherent, multi-step plan
  2. the interaction between the learner’s decisions and its own input distribution (an on-policy versus off-policy distinction)
  3. the increased necessity of minimizing the typically high cost of gathering examples

In practice, distinctions arise because of the structural properties of policies we attempt to imitate, and the difficulty of "resetting" state and restarting predictions is too costly or even infeasible in most imitation learning settings because a physical system is often involved.

In addition, it is often the case that the embodiments of the expert and the learner are different. For example, when transferring human skills to a humanoid robot, the motion captured from a human expert may be infeasible for the humanoid. This kind of adaptation is less common in the standard supervised learning.

Formulation of the Imitation Learning Problem

Suppose that the behavior of the expert demonstrator (or the learner itself) can be observed as a trajectory $\mathbf{\tau} = [\mathbf{\phi}_0, \dots, \mathbf{\phi}_T]$, which is a sequence of features $\phi$. The features $\phi$ can be the state of the robotic system or any other measurements chosen according to the given problem.

Often, the demonstrations are recorded under different conditions, for example, grasping an object at different locations. These tasks conditions are referred as context vector $\mathbf{s}$ of the task which are stored together with the feature trajectories. Optionally, a reward signal $r$ that the expert is trying to optimize is also available in some problem settings.

In imitation learning, we collect a dataset of demonstrations $\mathbf{\mathcal{D}} = \{(\mathbf{\tau}_i, \mathcal{s}_i, \mathcal{r}i)\}^{N}{i=1}$. The data collection process can be both offline and online. Using the collected dataset $\mathbf{\mathcal{D}}$, a common optimization-based strategy learns a policy $\pi^*$ that satisfies

$$ \pi^* = \argmin D(q(\mathbf{\phi}), p(\mathbf{\phi})), \tag{1.1} $$

where $q(\mathbf{\phi})$ is the distribution of the features induced by the experts’ polic, $p(\mathbf{\phi})$ is the distribution of the features induced by the learner and $D(q,p)$ is a similarity measure between $q$ and $p$.

In addition, we often have access to an environment such as a simulator or a physical robotic system where we can perform and evaluate a policy through interaction. This simulator can be used to gather new data and iteratively improve the policy to better match the demonstrations.