Bayesian Learning

Recap

Before we dive into Bayesian learning, it would probably be good to recap on Bayes Theorem and the Frequentist view of linear regression

Bayes’ Theorem

Events

Suppose events $B_1, B_2, \dots, B_m$ partition the sample space (that is, they are mutually exclusive and collectively exhaustive), then using the law of total probability, for each $B_k$ and any event $A$ with $P(A) > 0$

$$ P(B_k|A) = \frac{P(A|B_K)P(B_K)}{P(A)}= \frac{P(A|B_k) P(B_k)}{\sum_{i=1}^m P(A| B_i) P(B_i)} $$

Random Variables

Using the same assumption as in the law of total probability, given continuous random variables $X$ and $Y$, Bayes’ Theorem is given by

$$ p_{X|Y}(x|y) = \frac{p_{Y|X}(y|x)p_X(x)}{p_Y(y)} = \frac{p_{Y|X}(y|x)p_X(x)}{\int p_{Y|X}(y|x)p_X(x)dx} $$

where each $p$ is a density function and $p_Y(y) > 0$.

Linear Regression

The linear regression model assumes that the relationship between the dependent variable $y$ and the regressors $x$ is linear. Given a sample of $n$ paired observations, i.e. $(x_1, y_1), (x_2, y_2),\dots , (x_n, y_n)$ and each input $x_i \in \R^{1\times d}$ is a vector with $d$ features, the relationship between $y$ and $x$ is expressed as such:

$$ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2x_{i2} +\dots +\beta_dx_{id} + \epsilon_i, \qquad i = 1, \dots, n $$

where $\beta_0$ is the intercept term, $\beta_1, \beta_2, \dots, \beta_d$ are the coefficients and $\epsilon_i$ is the residual (or error) term.

For convenience, we express the formula using matrix notation, that is

$$ Y = X\beta+\epsilon $$

where $\beta$ is a $(d + 1)$-dimensional vector which includes the intercept term and the coefficients for each feature.