Before we dive into Bayesian learning, it would probably be good to recap on Bayes Theorem and the Frequentist view of linear regression
Events
Suppose events $B_1, B_2, \dots, B_m$ partition the sample space (that is, they are mutually exclusive and collectively exhaustive), then using the law of total probability, for each $B_k$ and any event $A$ with $P(A) > 0$
$$ P(B_k|A) = \frac{P(A|B_K)P(B_K)}{P(A)}= \frac{P(A|B_k) P(B_k)}{\sum_{i=1}^m P(A| B_i) P(B_i)} $$
Random Variables
Using the same assumption as in the law of total probability, given continuous random variables $X$ and $Y$, Bayes’ Theorem is given by
$$ p_{X|Y}(x|y) = \frac{p_{Y|X}(y|x)p_X(x)}{p_Y(y)} = \frac{p_{Y|X}(y|x)p_X(x)}{\int p_{Y|X}(y|x)p_X(x)dx} $$
where each $p$ is a density function and $p_Y(y) > 0$.
The linear regression model assumes that the relationship between the dependent variable $y$ and the regressors $x$ is linear. Given a sample of $n$ paired observations, i.e. $(x_1, y_1), (x_2, y_2),\dots , (x_n, y_n)$ and each input $x_i \in \R^{1\times d}$ is a vector with $d$ features, the relationship between $y$ and $x$ is expressed as such:
$$ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2x_{i2} +\dots +\beta_dx_{id} + \epsilon_i, \qquad i = 1, \dots, n $$
where $\beta_0$ is the intercept term, $\beta_1, \beta_2, \dots, \beta_d$ are the coefficients and $\epsilon_i$ is the residual (or error) term.
For convenience, we express the formula using matrix notation, that is
$$ Y = X\beta+\epsilon $$
where $\beta$ is a $(d + 1)$-dimensional vector which includes the intercept term and the coefficients for each feature.