Multi-Objective Reinforcement Learning (MORL)

Why Multi-Objective Reinforcement Learning?

Reinforcement learning algorithms usually consider a single-objective framework, where the agent’s goal is to optimize a single reward function. However, most real-world decision problems are inherently multi-objective. For example, when making decisions about purchasing a car, we can be faced with competing desires: minimising cost while maximising comfort, maximising performance while minimising fuel consumption.

As such, a multi-objective optimisation or Pareto optimisation approach is essential to solving these decision problems. Multi-objective optimisation allows us to find multiple solutions that offer trade-offs among the various objectives, circumventing the need for a priori scalarization. Multi-objective Reinforcement Learning (MORL) extends the principles of multi-objective optimisation into the domain of sequential decision-making under uncertainty. Specifically, we are dealing with problems that have multiple objectives to be achieved by the agent, each with its own associated reward signal.

Problem Formulation

Let’s start by formalising the MORL setting.

Multi-Objective MDP

We consider an infinite horizon discounted Markov Decision process (MDP) and extend it into the Multi-Objective MDP (MOMDP) $\mathcal{M}$ by simply changing the reward function to be vector-valued rather than a scalar. MOMDP is defined by the tuple

$$ \langle S, A, T, \gamma, \mu, K, \mathbf{R} \rangle $$

where

$S$ denotes the state space
$A$ denotes the action space
$T:S \times A \times S \rightarrow [0,1]$ is a transition function, that denotes the probability that the agent moves from state $s$ to state $s’$ after taking action $a$
$\gamma \in [0,1]$ is the discount factor that specifies how important future rewards are to the current state
$\mu: S \rightarrow [0,1]$ is the probability distribution over initial states
$[K] = \set{1,2,\dots, K}$ denotes the set of $K$ objectives
$\mathbf{R}: S \times A \times S \rightarrow \mathbb{R}^K$ is the vector-valued reward function that specifies the immediate reward for each objective after a transition