Maximum Likelihood Estimation, Kullback-Leibler Divergence and Entropy

MLE minimizes Kullback-Leibler divergence

Given $N$ samples of i.i.d. data $X=\set{x_1,\dots,x_N}$, we know the formula for MLE is

$$ \begin{align} \notag \hat \theta &= \underset{\theta}{argmax} \prod_{i=1}^Np(x_i|\theta)\\\ \notag &= \underset{\theta}{argmax} \sum_{i=1}^N\log p(x_i|\theta)\\\ \notag &= \underset{\theta}{argmax} \frac{1}{N}\sum_{i=1}^N\log p(x_i|\theta) \end{align} $$

Note we can do the above logarithmic conversion because $\underset{\theta}{argmax}$ does not find the values of the $\theta$ that maximises the likelihood but rather where it occurs. Logarithm is monotonically increasing so the position of the optimal $\theta$ is not affected. MLE is a frequentist approach and it assumes there exists a true optimal parameter $\Theta$ of the model that describes the data. This is the parameter we are trying to estimate. Note $\Theta$ is a fixed parameter that does not depend on our current parameters $\theta$. Using my earlier explanation of the properties of $\underset{\theta}{argmax}$, we can subtract a constant value from the expression and it will not affect the position of the “most likely” $\theta$. As such, we subtract the average log probability we observe $X$ given the true parameters $\Theta$

$$ \hat\theta = \underset{\theta}{argmax} [\frac{1}{N}\sum_{i=1}^N\log p(x_i|\theta) - \frac{1}{N}\sum_{i=1}^Nlog(x_i|\Theta)]

We will then exploit a few properties of logarithms

$$ \begin{align} \hat\theta &= \underset{\theta}{argmax} \frac{1}{N}\sum_{i=1}^N\log \frac{p(x_i|\theta)}{p(x_i|\Theta)}\\\ &= \underset{\theta}{argmin} \frac{1}{N}\sum_{i=1}^N(-1)\log \frac{p(x_i|\theta)}{p(x_i|\Theta)}\\\ &= \underset{\theta}{argmin} \frac{1}{N}\sum_{i=1}^N\log \frac{p(x_i|\Theta)}{p(x_i|\theta)} \end{align} $$

Let me clarify what we have just did. In (1), we use the quotient rule to combine the two logarithm terms. Then, in (2), we negate the inner log expression and therefore we have to change the argmax to argmin. Why? Because

$$ f(x) \leq f(y) \iff -f(x) \geq -f(y)\\ \implies \underset{x}{argmax}\space f(x) = \underset{x}{argmin}\space (-f(x)) $$

Lastly, in (3), we remove the negation by using the product rule to inverse the fraction.

Finally, the law of large numbers tell us that as the number of samples increase, the average will be close to the expected value (mean).

$$ \lim_{n\to\infty}\frac{1}{n} \sum_{i=1}^n X_i = E_{x\sim P(x)}[X] =\int xf(x)dx $$

where $f(x)$ is the PDF.

Applying this knowledge to our expression, given a random variable $x$ is drawn the true distribution $p(x|\Theta)$

$$ \begin{align} \notag \hat\theta &= \underset{\theta}{argmin} \space \mathbb{E}{x\sim p(x|\Theta)}\left[\log \frac{p(x_i|\Theta)}{p(x_i|\theta)}\right]\\\ \notag &= \underset{\theta}{argmin}\int p(x|\Theta)\log \frac{p(x|\Theta)}{p(x|\theta)}dx\\\ \notag &= \underset{\theta}{argmin} \space D{KL}\left[p(x|\Theta) || p(x|\theta) \right] \end{align} $$

The term $D_{KL}\left[p(x|\Theta) || p(x|\theta) \right]$ is precisely the Kullback-Leibler divergence from $p(x|\Theta)$ to $p(x|\theta)$. It measures how different the former probability distribution is from the latter probability distribution. This means that estimating the best parameters using MLE is the same as finding the parameters that minimizes the KL divergence from the true probability distribution to our current probability distribution.

Moreover, with some extra manipulations

$$ \begin{align} \notag \hat\theta &= \underset{\theta}{argmin}\int p(x|\Theta)\log \frac{p(x|\Theta)}{p(x|\theta)}dx\\\ \notag &= \underset{\theta}{argmin}\int p(x|\Theta)\log p(x|\Theta)dx - \int p(x|\Theta)\log p(x|\theta)dx\\\ \notag &= \underset{\theta}{argmin}\int p(x|\Theta)\log \frac{1}{p(x|\Theta)}dx - \int p(x|\Theta)\log \frac{1}{p(x|\Theta)}dx\\\ \notag &= \underset{\theta}{argmin} \space H\left[p(x|\Theta),p(x|\theta)\right] - H\left[p(x|\Theta)\right] \end{align} $$