Design Choices for Imitation Learning Algorithms

When developing an imitation learning method, it is necessary to make several design choices to formalize the problem:

  1. Access to the reward function
    1. Imitation learning vs Reinforcement learning
    2. Hybrid approaches (e.g., SEARN, AggreVaTe) combine imitation and reinforcement learning, ensuring the learner can achieve near-optimal performance by refining an initially learned policy.
  2. Parsimonious description of the desired behavior
    1. Behavioral Cloning: Direct mapping from features to actions/trajectories, effective for simple, reactive tasks.
    2. Inverse Reinforcement Learning (IRL): Learning a cost function for complex, long-horizon tasks that require planning.
  3. Access to System Dynamics
    1. Model-based methods: Require system dynamics for tasks like motion planning in under-actuated robots.
    2. Model-free methods: Can be used when sufficiently accurate controllers are available, avoiding the complex task of learning system dynamics.
  4. Similarity Measure Between Policies
    1. When a reward function is unavailable, similarity between expert and learner policies must be defined.
    2. Similarity is ideally defined over entire trajectories rather than individual decisions.
  5. Features
    1. Effective feature selection is critical for expressing desired behavior while minimizing complexity.
    2. Deep learning has enabled automatic extraction of feature representations.
  6. Policy Representation
    1. Policies can be represented using neural networks, linear functions, etc.
    2. The level of task abstraction (e.g., task level, trajectory level, action-state level) must be chosen carefully.
    3. Increasing policy complexity improves expressive power but requires more data and time for learning.

Behavioral Cloning and Inverse Reinforcement Learning

Behavioral Cloning (BC) learns a policy that directly maps input states/contexts to actions using supervised learning on demonstrated trajectories. Given a dataset of demonstrated trajectories with state-action pairs and contexts $\mathbf{\mathcal{D}} = \{(\mathcal{x}_t, \mathcal{s}_t, \mathcal{u}_t)\}$, we can directly compute a mapping from states and/or contexts to control inputs as

$$ \mathcal{u} = \pi(\mathcal{x}_t, \mathcal{s}_t) \tag{2.1} $$

Alternatively, given a reward signal, a policy can be obtained by optimizing the expected return under the learned reward function:

$$ \pi = \argmax_{\hat{\pi}} J(\hat{\pi}), \tag{2.2} $$

where $J(\hat{\pi})$ is the expectation of the accumulated reward given the policy $\pi$. However, the reward function is considered unknown and needs to be recovered from expert demonstrations under the assumption that the demonstrations are (approximately) optimal w.r.t. this reward function. Recovering the reward function from demonstrations is often referred to as Inverse Reinforcement Learning (IRL).

Model-Free and Model-Based Imitation Learning Methods

image.png

Model-free imitation learning methods directly learn a policy to replicate expert behavior without attempting to model the underlying system dynamics. This approach is simpler and avoids the complexities associated with estimating dynamics. It is particularly effective for fully-actuated robotic systems, such as industrial robots with reliable position and velocity controllers, where the dynamics are negligible, and smooth trajectories can be easily planned. Consequently, model-free behavioral cloning (BC) has been widely adopted for such applications.