What are Diffusion Models?

The concept of diffusion-based generative modelling was actually proposed early in 2015 by Sohl-Dickstein et al. who were inspired by non-equilibrium thermodynamics. The idea behind it as described by Sohl-Dickstein et al. (2015) is this:

The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.

Half a decade later, Ho et al. (2020) proposed Denoising Diffusion Probabilistic Models (DDPMs), which improved upon the previous method by introducing significant simplifications to the training process. Soon after, a 2021 paper by OpenAI demonstrated DDPMs’ superior performance in image synthesis tasks compared to Generative Adversarial Networks (GANs). Since then, notable diffusion-based generative models have been released such as DALL-E, Stable Diffusion and Imagen. I’ll be covering the concept underlying diffusion models, mainly focusing on DDPMs.

To better understand this, let’s focus on the forward diffusion process and reverse diffusion process separately.

Forward Diffusion Process

Image by Karagiannakos and Adaloglou (2022) ****modified from Ho et al. (2020)

Image by Karagiannakos and Adaloglou (2022) ****modified from Ho et al. (2020)

In the forward trajectory, we want to gradually “corrupt” the training images. As such, we iteratively apply Gaussian noise to images sampled from the true data distribution, i.e. $x_0 \sim q(x)$, in over $T$ steps to produce a sequence of noisy samples $x_0, x_1, \dots , x_T$.

The diffusion process is fixed to a Markov chain which simply means that each step is only dependent on the previous one (memoryless). Specifically, at each step, we apply Gaussian noise with variance $\beta_T \in (0,1)$ to $x_{t-1}$ to produce a latent variable $x_t$ of the same dimension. As such, each transition is parameterized as a diagonal Gaussian distribution that uses the output of the previous state as its mean:

$$ q(x_t|x_{t-1}) = \mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1}, \beta_t\mathbf{I}) $$

Note:

The posterior after $T$ steps, conditioned on the original data distribution, can be represented as a product of single step conditionals as such:

$$ q(x_{1:T} | x_0) = \prod^T_{t=1} q(x_t | x_{t-1}) $$