Reinforcement learning algorithms usually consider a single-objective framework, where the agent’s goal is to optimize a single reward function. However, most real-world decision problems are inherently multi-objective. For example, when making decisions about purchasing a car, we can be faced with competing desires: minimising cost while maximising comfort, maximising performance while minimising fuel consumption.
As such, a multi-objective optimisation or Pareto optimisation approach is essential to solving these decision problems. Multi-objective optimisation allows us to find multiple solutions that offer trade-offs among the various objectives, circumventing the need for a priori scalarization. Multi-objective Reinforcement Learning (MORL) extends the principles of multi-objective optimisation into the domain of sequential decision-making under uncertainty. Specifically, we are dealing with problems that have multiple objectives to be achieved by the agent, each with its own associated reward signal.
Let’s start by formalising the MORL setting.
We consider an infinite horizon discounted Markov Decision process (MDP) and extend it into the Multi-Objective MDP (MOMDP) $\mathcal{M}$ by simply changing the reward function to be vector-valued rather than a scalar. MOMDP is defined by the tuple
$$ \langle S, A, T, \gamma, \mu, K, \mathbf{R} \rangle $$
where