Linear Regression and Regularization

This page serves to help better understand linear regression and regularization.

Multivariate Linear Regression in Matrix Form

We are given a sample of $n$ paired observations, i.e. $(x_1, y_1), (x_2, y_2),\dots , (x_n, y_n)$. Each input $x_i \in \R^{1\times d}$ is a vector with $d$ features, which means

$$ x_i = \begin{bmatrix} x_{i1}\\ x_{i2}\\ \vdots\\ x_{id} \end{bmatrix} $$

The inputs are also known as the predictors, regressors or covariates. The output will be assumed to be univariate, i.e. $y_i \in \R$.

A linear regression model assumes that the relationship between the dependent variable $y$ and the regressors $x$ is linear. That is:

$$ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2x_{i2} +\dots +\beta_dx_{id} + \epsilon_i, \qquad i = 1, \dots, n $$

where $\beta_0$ is the intercept term, $\beta_1, \beta_2, \dots, \beta_d$ are the coefficients and $\epsilon_i$ is the residual (or error) term.

Given $i = 1,\dots,n$, we have $n$ equations, i.e.

$$ \begin{align} \notag y_1 &= \beta_0 + \beta_1x_{11} + \beta_2x_{12} + \dots +\beta_dx_{1d} + \epsilon_1\\\ \notag y_2 &= \beta_0 + \beta_1x_{21}+ \beta_2x_{22} + \dots +\beta_dx_{2d} + \epsilon_2\\\ \notag \vdots\\\ \notag y_n &= \beta_0 + \beta_1x_n + \beta_2x_{n2} + \dots +\beta_dx_{nd} + \epsilon_n \end{align} $$

For convenience sake, we often stack these $n$ equations together and write them in matrix notation as such

$$ \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \dots & x_{1d}\\ 1 & x_{21} & \dots & x_{2d}\\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_n & \dots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_0\\ \beta_1\\ \vdots\\ \beta_d \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n \end{bmatrix}\\ \implies Y = X\beta+\epsilon $$

$X$ is sometimes called the design matrix; a matrix of explanatory variables
$Y$ is a vector of observed values of the dependent variable
$\beta$ is a $(d + 1)$-dimensional parameter vector which includes the intercept term and the coefficients for each feature. This is the parameter we want to optimise to better fit our linear regression model
$\epsilon$ is a error term which captures the error in our linear regression model. It captures all other factors which influence $Y$ other than $X$ which is not captured by our model → $\epsilon = Y-X\beta$. This is the term we want to minimise!