Linear model

Consider a multiple linear regression model as follows: $y = X β + ϵ,$ where $y = (y_{1}, y_{2}, \dots, y_{n})^{T}$ is the $n$ -dimensional response vector, $X = (x_{1}, x_{2}, \dots, x_{n})^{T}$ is the $n \times p$ design matrix, and $ϵ \sim N_{n} (0, σ^{2} I_{n})$ . We assume that $p < n$ and $X$ is full rank.

Maximum likelihood estimation (MLE) approach

Since $y \sim N_{n} (X β, σ^{2} I_{n})$ , the likelihood function is given as $\begin{matrix} L (β, σ^{2}) & = & f (y | β, σ^{2}) = & (2 π)^{- \frac{n}{2}} | Σ |^{- \frac{1}{2}} exp {- \frac{1}{2} (y - X β)^{T} Σ^{- 1} (y - X β)}, \end{matrix}$

where $Σ = σ^{2} I_{n}$ . Then the log likelihood can be written as $\begin{matrix} l (β, σ^{2}) & = & log L (β, σ^{2}) = & - \frac{n}{2} log (2 π) - \frac{n}{2} log (σ^{2}) - \frac{1}{2 σ^{2}} (y - X β)^{T} (y - X β) . \end{matrix}$ Note that $l (β, σ^{2})$ is a concave function in $(β, σ^{2})$ . Hence, the maximum likelihood estimator (MLE) can be obtained by solving the following equations: $\begin{matrix} \frac{\partial l (β, σ^{2})}{\partial β} & = & - \frac{1}{2 σ^{2}} (- 2 X^{T} y + 2 X^{T} X β) = 0; \frac{\partial l (β, σ^{2})}{\partial σ^{2}} & = & - \frac{n}{2} \frac{1}{σ^{2}} + \frac{1}{2} \frac{1}{(σ^{2})^{2}} (y - X β)^{T} (y - X β) = 0. \end{matrix}$ Therefore, the MLEs of $β$ and $σ^{2}$ are given as $\begin{matrix} ^β & = & (X^{T} X)^{- 1} X^{T} y; {^σ}^{2} & = & \frac{(y - X^β)^{T} (y - X^β)}{n} = \frac{∥ y -^y ∥^{2}}{n}, \end{matrix}$ where $^y = X^β$ .

Distribution of $^β$ and ${^σ}^{2}$

Note that if $z \sim N_{k} (μ, Σ)$ , then $A z \sim N_{m} (A μ, A Σ A^{T})$ , where $A$ is an $m \times k$ matrix. Since $y \sim N_{n} (X β, σ^{2} I_{n})$ and $^β = (X^{T} X)^{- 1} X^{T} y$ , we have $\begin{matrix} ^β & \sim & N_{p} ((X^{T} X)^{- 1} X^{T} X β, σ^{2} (X^{T} X)^{- 1} X^{T} X (X^{T} X)^{- 1}) = & N_{p} (β, σ^{2} (X^{T} X)^{- 1}) . \end{matrix}$

Note that $y -^y = (I_{n} - X (X^{T} X)^{- 1} X^{T}) y$ , where $I_{n} - X (X^{T} X)^{- 1} X^{T}$ is an idempotent matrix with rank $(n - p)$ .

$Can you prove that I_{n} - X (X^{T} X)^{- 1} X^{T} s an idempotent matrix of rank (n - p) ?$

Proof. Let $H = X (X^{T} X)^{- 1} X^{T}$ .

$H H = X (X^{T} X)^{- 1} X^{T} X (X^{T} X)^{- 1} X^{T} = X (X^{T} X)^{- 1} X^{T} = H$ thus $H$ is idempotent matrix.

Similarly, as $(I_{n} - H) (I_{n} - H) = I_{n} - H$ , $(I_{n} - H)$ is also idempotent.

Hence we have

r (I_{n} - H) = t r (I_{n} - H) = n - t r (H) = n - t r ((X^{T} X)^{- 1} X^{T} X) = n - p

$How to prove r (I_{n} - H) = t r (I_{n} - H) ?$

The eigenvalue of idempotent matrix is either 1 or 0, hence the rank of it is the sum of eigenvalues, which equals the trace of matrix;

$How to prove trace of matrix is the sum of eigenvalues?$

It requires characteristic polynomial.

Another way to prove this is in this link

From Lemma 1, we have that $\begin{matrix} n \frac{{^σ}^{2}}{σ^{2}} \sim χ^{2} (n - p), \\ (1) \end{matrix}$ where $χ^{2} (n - p)$ denotes the chi-squared distribution with degrees of freedom $n - p$ .

$Can you prove Eq.$ (1)?

Proof. By Lemma 1, we have
$n \frac{{^σ}^{2}}{σ^{2}} = \frac{{^e}^{T}^e}{σ^{2}} = y^{T} (\frac{I_{n} - H}{σ^{2}}) y = y^{T} A y \sim χ^{2} (t r (A Σ), μ^{T} A μ / 2)$

where

A = \frac{I_{n} - H}{σ^{2}}

,

μ^{T} A μ / 2 = (X β)^{T} (\frac{I_{n} - H}{σ^{2}}) (X β) / 2 = 0

.

A Σ = (\frac{I_{n} - H}{σ^{2}}) σ^{2} I_{n} = I_{n} - H,

hence

r (A Σ) = n - p

. Therefore

n \frac{{^σ}^{2}}{σ^{2}} \sim χ^{2} (n - p)

.

Lemma 1 Let

z \sim N_{k} (μ, Σ)

with

r (Σ) = k

, where

r (Σ)

denotes the rank of

Σ

. If

A Σ

(or

Σ A

) is an idempotent matrix of rank

m

, then

z^{T} A z

follows the non-central chi-squared distribution with degrees of freedom

m

and non-central parameter

λ = μ^{T} A μ / 2

.

Bayesian approach

$σ^{2}$ is known

Suppose $σ^{2}$ is known. We define the prior distribution of $β$ by $β \sim N_{p} (0, σ^{2} ν I_{p})$ . Then the posterior density of $β$ can be obtained by $\begin{matrix} π (β | y) & \propto & f (y | β) π (β) \propto & exp (- \frac{1}{2 σ^{2}} ∥ y - X β ∥^{2}) \times exp (- \frac{1}{2 σ^{2} ν} ∥ β |^{2}) = & exp [- \frac{1}{2 σ^{2}} {(y - X β)^{T} (y - X β) + \frac{1}{ν} β^{T} β}] \propto & exp {- \frac{1}{2 σ^{2}} (- 2 β^{T} X^{T} y + β^{T} X^{T} X β + \frac{1}{ν} β^{T} β)} = & exp {- \frac{1}{2 σ^{2}} (- 2 β^{T} (X^{T} X + \frac{1}{ν} I_{p}) {(X^{T} X + \frac{1}{ν} I_{p})}_{p}^{- 1} X^{T} y + β^{T} (X^{T} X + \frac{1}{ν} I_{p}) β)} \propto & exp {- \frac{1}{2 σ^{2}} (β - ~ β)^{T} (X^{T} X + \frac{1}{ν} I_{p}) (β - ~ β)} \end{matrix}$ where $~ β = {(X^{T} X + \frac{1}{ν})}^{- 1} X^{T} y$ .

This implies that $\begin{matrix} β | y \sim N_{p} ({(X^{T} X + \frac{1}{ν} I_{p})}_{p}^{- 1} X^{T} y, σ^{2} {(X^{T} X + \frac{1}{ν} I_{p})}_{p}^{- 1}) . \\ (2) \end{matrix}$ The Bayesian estimate ${^β}_{B a y e s} = {(X^{T} X + \frac{1}{ν} I_{p})}_{p}^{- 1} X^{T} y$ . It is worth noting that if $ν \to \infty$ , the posterior distribution converges to the distribution of ${^β}_{M L E} \sim N_{p} (β, σ^{2} (X^{T} X)^{- 1})$ .

$σ^{2}$ is unknown

In general, $σ^{2}$ is unknown. Then, we need to assign a reasonable prior distribution for $σ^{2}$ . We consider the inverse gamma distribution, which is called a , as follows: $σ^{2} \sim I G (a_{0}, b_{0})$ with the density function $\begin{matrix} π (σ^{2}) = \frac{b_{0}^{a_{0}}}{Γ (a_{0})} (σ^{2})^{- a_{0} - 1} exp (- \frac{b_{0}}{σ^{2}}), \end{matrix}$ where $a_{0} > 0$ and $b_{0} > 0$ . In addition, we need to introduce prior for $β | σ^{2}$ :

$β | σ^{2} \sim N_{p} (0, σ^{2} ν I_{p}) .$

$Today, we derive the joint posterior distribution π (β, σ^{2} | y) \propto f (y | β, σ) π (β | σ^{2}) π (σ^{2}) .$

Show

1. $π (σ^{2} | β, y) = I G (a^{*}, b^{*})$
1. $π (β | y) \sim$ t-distribution with $t^{*}$

Then the posterior density function of

σ^{2}

given

β, y

can be obtained by

Proof (1). $\begin{matrix} π (σ^{2} | β, y) = \frac{f (y | β, σ^{2}) π (β, σ^{2})}{\int f (y | β, σ^{2}) π (β, σ^{2}) d σ^{2}} = \frac{f (y | β, σ^{2}) π (β | σ^{2}) π (σ^{2})}{\int f (y, β, σ^{2}) d σ^{2}} \propto & f (y | β, σ^{2}) π (β | σ^{2}) π (σ^{2}) \propto & (σ^{2})^{- \frac{n}{2}} exp (- \frac{1}{2 σ^{2}} ∥ y - X β ∥^{2}) \times (σ^{2})^{- \frac{p}{2}} exp (- \frac{1}{2 σ^{2} ν} ∥ β ∥^{2}) \times (σ^{2})^{- a_{0} - 1} exp (- \frac{b_{0}}{σ^{2}}) = & (σ^{2})^{- (\frac{n + p}{2} + a_{0}) - 1} exp [- \frac{1}{σ^{2}} {\frac{1}{2} ∥ y - X β ∥^{2} + \frac{1}{2 ν} ∥ β ∥^{2} + b_{0}}] = & (σ^{2})^{- a^{*} - 1} exp (- \frac{b^{*}}{σ^{2}}) \end{matrix}$ where $a^{*} = \frac{n + p}{2} + a_{0}$ , $b^{*} = \frac{1}{2} ∥ y - X β ∥^{2} + \frac{1}{2 ν} ∥ β ∥^{2} + b_{0}$ .

This implies that

\begin{matrix} σ^{2} | β, y \sim I G (\frac{n + p}{2} + a_{0}, \frac{1}{2} ∥ y - X β ∥^{2} + \frac{1}{2 ν} ∥ β ∥^{2} + b_{0}) . \end{matrix}

Proof (2). $\begin{matrix} π (β | y) = & \int π (β, σ^{2} | y) d σ^{2} = & \int \frac{f (y | β, σ^{2}) π (β, σ^{2})}{\iint f (y | β, σ^{2}) π (β, σ^{2}) d β d σ^{2}} d σ^{2} \propto & \int f (y | β, σ^{2}) π (β | σ^{2}) π (σ^{2}) d σ^{2} \propto & Γ (a^{*}) b^{- a^{*}} \propto & {b^{*}}^{- a^{*}} = & [\frac{1}{2} ∥ y - X β ∥^{2} + \frac{1}{2 ν} ∥ β ∥^{2} + b_{0}]^{- a^{*}} \propto & {[(y - X β)^{T} (y - X β) + \frac{1}{ν} β^{T} β - 2 b_{0}]}_{0}^{- a^{*}} = & {[y^{T} y - 2 β^{T} X^{T} y + β^{T} X^{T} X β + \frac{1}{ν} β^{T} β + 2 b_{0}]}_{0}^{- a^{*}} \propto & {[β^{T} (X^{T} X + \frac{1}{ν} I_{p}) β - 2 β^{T} (X^{T} X + \frac{1}{ν} I_{p}) {(X^{T} X + \frac{1}{ν} I_{p})}_{p}^{- 1} X^{T} y + y^{T} y + 2 b_{0}]}_{0}^{- a^{*}} = & {[β^{T} A β - 2 β^{T} A ~ β + {~ β}^{T} A ~ β - {~ β}^{T} A ~ β + y^{T} y + 2 b_{0}]}_{0}^{- a^{*}} = & {[(β - ~ β)^{T} A (β - ~ β) + y^{T} y - y^{T} X A^{- 1} X^{T} y + 2 b_{0}]}_{0}^{- a^{*}} = & {[(β - ~ β)^{T} A (β - ~ β) + y^{T} (I_{n} - X A^{- 1} X^{T}) y + 2 b_{0}]}_{n}^{- a^{*}} \propto & {[(β - ~ β)^{T} A (β - ~ β) + c^{*}]}^{- a^{*}} \propto & {[1 + \frac{1}{c^{*}} (β - ~ β)^{T} A (β - ~ β)]}^{- \frac{n + p + 2 a_{0}}{2}} = & {[1 + \frac{ν^{*}}{ν^{*} c^{*}} (β - ~ β)^{T} A (β - ~ β)]}^{- \frac{p + ν^{*}}{2}} = & {[1 + \frac{1}{ν^{*}} (β - ~ β)^{T} (\frac{c^{*}}{ν^{*}} A^{- 1})^{- 1} (β - ~ β)]}^{- \frac{p + ν^{*}}{2}} \end{matrix}$ This implies that $\begin{matrix} β | y \sim M T (~ β, \frac{c^{*}}{ν^{*}} A^{- 1}, ν^{*}), \\ (3) \end{matrix}$

where

\begin{matrix} A & = & X^{T} X + \frac{1}{ν} I_{p}, ~ β & = & A^{- 1} X^{T} y, c^{*} & = & y^{T} (I_{n} - X A^{- 1} X^{T}) y + 2 b_{0}, ν^{*} & = & n + 2 a_{0} . \end{matrix}

Note: the density of a multiple t-distribution with $Σ, μ$ and $ν$ is (3) $\begin{matrix} \frac{Γ [(ν + p) / 2]}{Γ (ν / 2) ν^{p / 2} π^{p / 2} | Σ |^{1 / 2}} {[1 + \frac{1}{ν} (X - μ)^{T} Σ^{- 1} (X - μ)]}^{- (ν + p) / 2} . \end{matrix}$

Monte Carlo Simulation

Suppose $σ^{2}$ is unknown. If $β \sim N_{p} (0, σ^{2} ν I_{p})$ , we know the distribution of $β | y$ is Eq. (3). According to this known distribution, we can easily compute the $E (β | y)$ and $V (β | y)$ . But in practice, the form of density function $π (β | y)$ is often an unknown and very complicated distribution, leading to be impossible to solve its integration directly. hence we turn to use numerical methods to solve this problem, such as Monte Carlo methods.

For example, a form density function $π (θ | y)$ is an unknown distribution. $E (θ | y) = \int θ π (θ | y) d θ$ . According to lemma , ${¯ X}_{n} \to μ = E (X)$ as $n \to \infty$ . Therefore, if we indepedently generate $θ^{(1)}, θ^{(2)}, \dots, θ^{(M)}$ from $π (θ | y)$ , $M^{- 1} \sum_{k = 1}^{M} θ^{(k)} \to E (θ | y)$ as M $\to \infty$ . However, there are two problems.

What if we cannot generate sample from $π (θ | y)$
What if they are not identically and independently distributed.

The solutions to these two question are Monte Carlo (MC) method and Markov Chain Monte Carlo (MCMC).

For the first question, we can use importance sampling method.

Lemma 2 (Strong Law of Large Number)

Let

X_{1}, X_{2}, . . .

be a sequence of i.i.d random variables, each having finite mean m. Then

Pr (lim n \to \infty \frac{1}{n} (X_{1} + X_{2} + . . . + X_{n} = m)) = 1.

$σ^{2}$ is known

Importance Sampling

To estimate mean value of parameters, usually we randomly sample from $π (θ | y)$ and compute their mean value to estimate parameters. However, in practice, it is hard to generate sample directly from $π (θ | y)$ , because we don’t know what speccific distribution it belongs to. hence we return to importance sampling method.

Note that $\begin{matrix} E (θ | y) & = & \int θ π (θ | y) d θ = & \int θ \frac{π (θ | y)}{g (θ)} g (θ) d θ = & \int h (θ) g (θ) d θ = & E {h (θ)} \end{matrix}$ where $h (θ) = θ \frac{π (θ | y)}{g (θ)}, g (θ)$ is a known pdf.

To estimate the variance of parameters, similarly, $\begin{matrix} V a r (θ | y) & = & E [θ - E (θ | y)]^{2} = \int (θ - E (θ | y))^{2} π (θ | y) d θ = & \int (θ - E (θ | y))^{2} \frac{π (θ | y)}{g (θ)} g (θ) d θ = & \int h^{^{'}} (θ) g (θ) d θ = & E {h^{^{'}} (θ)} \end{matrix}$ where $h^{^{'}} (θ) = (θ - E (θ | y))^{2} \frac{π (θ | y)}{g (θ)},$ and $g (θ)$ is a known pdf.

The importance sampling method can be implemented as follows:

Note that $M \sum i = 1 \frac{h (θ_{i})}{M} \to E (h (θ)) = E (θ | y) . a s M \to \infty$ Estimating variance by importance sampling method is similarly.

Importance Sampling Simulation

Setup

Consider a single linear regression model as follows:

$y_{i} = β_{0} + β_{1} x_{i} + ε_{i}$ where $ε_{i} \sim N (0, 1), x_{i} \sim N (0, 1), β_{0} = 1, β_{1} = 2,$ for $i = 1, \dots, 100.$

We already know that if $β \sim N_{p} (0, σ^{2} ν I_{p})$ , then the distribution of $β | y$ is Eq.(2). Assume $σ = 1$ is known and consider a known pdf $π (β) = ϕ (β; (0, 0), 10 I_{2})$ , hence in this case, $ν = 10$ . Our simulation can be broken down into 3 steps in following:

Find exact value of $E (β_{0} | y), E (β_{1} | y), V (β_{0} | y)$ and $V (β_{1} | y)$
Use the MC method to simulate the results above with size of 100, 1000 and 5000.
Use Importance Sampling Method to simulate relevant results with the same size in (2). Consider the known pdf of parameters are $ϕ (β_{0} | 1, 0.5)$ and $ϕ (β_{1} | 2, 0.5) .$

Simulation Results

We use R softeware to do this simulation.

According to Eq. (2), compute $E (β | y) = {(X^{T} X + \frac{1}{ν} I_{p})}_{p}^{- 1} X^{T} y$ and $V a r (β | y) = σ^{2} {(X^{T} X + \frac{1}{ν} I_{p})}_{p}^{- 1}$ , where $σ = 1$ and $ν = 10$ , then we get the following results:

$E (β_{0})$ $E (β_{1})$ $V a r (β_{0})$ $V a r (β_{1})$

1.108 1.997 0.01 0.011
Sample directly from normal distribution of Eq. (2) with sample size of 100, 1000 and 5000, then compute the mean and variance values of samples. hence we get the following results:

$E (β_{0})$	$E (β_{1})$	$V a r (β_{0})$	$V a r (β_{1})$
1.108	1.997	0.01	0.011

	$E (β_{0})$	$E (β_{1})$	$V a r (β_{0})$	$V a r (β_{1})$
100	1.104	2.002	0.009	0.012
1000	1.107	1.993	0.011	0.012
5000	1.108	1.996	0.010	0.011

Sample

β_{0}

and

β_{1}

directly from

ϕ (β_{0} | 1, 0.5)

and

ϕ (β_{1} | 2, 0.5)

, respectively, with sample size of 100, 1000 and 5000. And then compute

h (θ_{i})

and

h^{^{'}} (θ_{i})

. Compute mean values of them to get estimates of expectation and variance. The final results are following:

	$E (β_{0})$	$E (β_{1})$	$V a r (β_{0})$	$V a r (β_{1})$
100	1.0500	2.0623	0.0175	0.0127
1000	1.1323	1.9521	0.0106	0.0127
5000	1.1269	2.0478	0.0107	0.0137

$σ^{2}$ is unknown

Suppose $σ^{2}$ is unknown, we cannot use $π (β | y, σ^{2})$ directly.

But we know $π (β, σ^{2} | y) = π (β | y, σ^{2}) π (σ^{2} | y)$ and we already know

$β | y, σ^{2} \sim N_{p} ((X^{T} X + \frac{1}{ν} I_{p})^{- 1} X^{T} y, (X^{T} X + \frac{1}{ν} I_{p})^{- 1} σ^{2})$

So $\begin{matrix} π (σ^{2} | y) = \int π (σ^{2}, β | y) d β \propto \int f (y | σ^{2}, β) π (β | σ^{2}) π (σ^{2}) d β = \int f (y | σ^{2}, β) π (β | σ^{2}) d β \times π (σ^{2}) \propto & \int (σ^{2})^{- \frac{n}{2}} exp [- \frac{1}{2 σ^{2}} (y - X β)^{T} (y - X β)] (σ^{2})^{- \frac{p}{2}} exp (- \frac{1}{2 σ^{2} ν} β^{T} β) d β \times π (σ^{2}) \propto & \int exp [- \frac{1}{2 σ^{2}} (y^{T} y - 2 β^{T} X^{T} y + β^{T} X^{T} X β + \frac{1}{ν} β^{T} β)] d β ((σ^{2})^{- \frac{1}{2} (n + p)} π (σ^{2})) = & \int exp [- \frac{1}{2 σ^{2}} (β^{T} (X^{T} X + \frac{1}{ν} I_{p}) β - 2 β^{T} (X^{T} X + \frac{1}{ν} I_{p}) (X^{T} X + \frac{1}{ν} I_{p})^{- 1} X^{T} y + y^{T} y)] d β ((σ^{2})^{- \frac{1}{2} (n + p)} π (σ^{2})) = & \int exp [- \frac{1}{2 σ^{2}} (β - ~ β)^{T} A (β - ~ β) + y^{T} y - {~ β}^{T} A ~ β] d β ((σ^{2})^{- \frac{1}{2} (n + p)} π (σ^{2})) = & \int exp [- \frac{1}{2 σ^{2}} (β - ~ β)^{T} A (β - ~ β)] d β \times exp [- \frac{1}{2 σ^{2}} (y^{T} y - y^{T} X A^{- 1} X^{T} y)] ((σ^{2})^{- \frac{1}{2} (n + p)} π (σ^{2})) = & (2 π)^{\frac{p}{2}} | σ^{2} A^{- 1} |^{\frac{1}{2}} exp [- \frac{1}{2 σ^{2}} (y^{T} (I_{n} - X A^{- 1} X^{T}) y)] ((σ^{2})^{- \frac{1}{2} (n + p)} π (σ^{2})) \propto & (σ^{2})^{\frac{p}{2} - \frac{1}{2} (n + p)} (σ^{2})^{- a_{0} - 1} exp [- \frac{1}{σ^{2}} (b_{0} + \frac{1}{2} y^{T} (I_{n} - X A^{- 1} X^{T}) y)] = & (σ^{2})^{- (\frac{1}{2} n + a_{0}) - 1} exp [- \frac{1}{σ^{2}} (b_{0} + \frac{1}{2} y^{T} (I_{n} - X A^{- 1} X^{T}) y)] = & (σ^{2})^{- a^{*} - 1} exp (- \frac{b^{*}}{σ^{2}}) \end{matrix}$ where $A = (X^{T} X + \frac{1}{ν} I_{p}), ~ β = A^{- 1} x^{T} y, a^{*} = \frac{1}{2} n + a_{0}, b^{*} = b_{0} + \frac{1}{2} y^{T} (I_{n} - X A^{- 1} X^{T}) y$ .

This is proportional to the pdf of $I G (a^{*}, b^{*})$ . Hence we have $\begin{matrix} E (β | y) & = & \int β π (β | y) d β = \int β [\int π (β, σ^{2} | y) d σ^{2}] d β = & \int \int β [π (β | y, σ^{2}) π (σ^{2} | y) d σ^{2}] d β, \end{matrix}$ Similarly, $\begin{matrix} V a r (β | y) & = & \int (β - E (β | y)) π (β | y) d β = \int (β - E (β | y)) [\int π (β, σ^{2} | y) d σ^{2}] d β = & \int \int (β - E (β | y)) [π (β | y, σ^{2}) π (σ^{2} | y) d σ^{2}] d β, \end{matrix}$

Plug-in Sampling Method

We can estimate mean of $β$ by computing $E (β | y) = \sum_{m = 1}^{M} β^{(m)} / M$ , where $β^{(m)}$ are generaed as follows:

1. Generate $σ^{2 (m)}$ from $π (σ^{2} | y)$ , where $σ^{2} | y$ is from $I G (a^{*}, b^{*})$
1. For given $σ^{2 (m)}$ , generate $β^{(m)}$ from $π (β | y, σ^{2 (m)})$ , where it is from normal distribution in Eq. (2)

Simulation Results

We use the same setup in the Section Importance Sampling Simulation. The final result is as follows:

	$E (β_{0})$	$E (β_{1})$	$V a r (β_{0})$	$V a r (β_{1})$
100	1.1099	1.9960	0.0073	0.0107
1000	1.1049	2.0004	0.0079	0.0084
5000	1.1081	1.9966	0.0081	0.0088

Markov Chain Monte Carlo Simulation

Gibbs Samping

Suppose $σ^{2}$ is unknown, we know $σ^{2} | y, β_{0}, β_{1} \sim I G (a^{*}, b^{*})$ $β | y, σ^{2} \sim N_{p} ((X^{T} X + \frac{1}{ν} I_{p})^{- 1} X^{T} y, (X^{T} X + \frac{1}{ν} I_{p})^{- 1} σ^{2})$ hence we can get the conditional distribution of each parameter, $β_{0} | β_{1}, y, σ^{2} \sim N (μ_{0} + \frac{σ_{0}}{σ_{1}} ρ (β_{1} - μ_{1}), (1 - ρ^{2}) σ_{1}^{2})$ $β_{1} | β_{0}, y, σ^{2} \sim N (μ_{1} + \frac{σ_{1}}{σ_{0}} ρ (β_{0} - μ_{0}), (1 - ρ^{2}) σ_{0}^{2})$ where $σ_{0} = [1, 0] A^{- 1} σ^{2} [1, 0]^{T}$ , $σ_{1} = [0, 1] A^{- 1} σ^{2} [0, 1]^{T}$ , $μ_{0} = A^{- 1} X^{T} y [1, 0]^{T}$ and $μ_{1} = A^{- 1} X^{T} y [0, 1]^{T}$ .

To generate $β^{(m)}$ from $π (β | y)$ , we can use a MCMC method of Gibbs sampling. The Gibbs sampling, which is one of the most popular MCMC methods, can be implemented as follows:

Set the initial value $(β_{0}^{(0)}, β_{1}^{(0)}, σ^{2 (0)})$ . Then iterate the following steps for $t = 0, 1, 2, \dots$ .

Step1: Generate $σ^{2 (t + 1)}$ from $π (σ^{2} | y, β_{0}^{(t)}, β_{1}^{(t)})$ .

Step2: Generate $β_{0}^{(t + 1)}$ from $π (β_{0} | y, β_{1}^{(t)}, σ^{2 (t + 1)})$ .

Step3: Generate $β_{1}^{(t + 1)}$ from $π (β_{1} | y, β_{0}^{(t + 1)}, σ^{2 (t + 1)})$ .

We setup the total sample size is 5000, and burning period is 2000.

The final result is as follows:

	$E (β_{0})$	$E (β_{1})$	$V a r (β_{0})$	$V a r (β_{1})$
5000	1.1096	1.9955	0.0082	0.009

The diagnosis tool to assess the convergence of the sampler is as follows:

$$\beta_0$; $\beta_1$; the marginals and the joint distribution$ $$\beta_0$; $\beta_1$; the marginals and the joint distribution$ $$\beta_0$; $\beta_1$; the marginals and the joint distribution$

Figure 1: $β_{0}$ ; $β_{1}$ ; the marginals and the joint distribution

Bayesian V.S. Frequentist Case Study

Multivariable Normal Conditional Distribution

The conditional posterior of $β_{j}$ is obtained as follows: $\begin{matrix} π (β_{j} | β_{- j}, σ^{2}, y) & \propto & π (β_{j}, β_{- j} | σ^{2}, y) = & π (β | σ^{2}, y) \propto & f (β, y | σ^{2}) = & f (y | β, σ^{2}) π (β | σ^{2}) \propto & exp (- \frac{1}{2 σ^{2}} ∥ y - X β ∥^{2}) exp (- \frac{1}{2 σ^{2} ν} ∥ β ∥^{2}) \propto & exp [- \frac{1}{2 σ^{2}} ((y - X_{- j} β_{- j} - x_{j} β_{j})^{T} (y - X_{- j} β_{- j} - x_{j} β_{j}) + \frac{1}{ν} β_{j}^{T} β_{j})] \propto & exp [- \frac{1}{2 σ^{2}} (- y^{T} x_{j} β_{j} + β_{- j}^{T} X_{- j}^{T} x_{j} β_{j} - β_{j}^{T} x_{j}^{T} y + β_{j}^{T} x_{j}^{T} X_{- j} β_{- j} + β_{j}^{T} x_{j}^{T} x_{j} β_{j} + \frac{1}{ν} β_{j}^{T} β_{j})] \propto & exp [- \frac{1}{2 σ^{2}} (- 2 β_{j}^{T} x_{j}^{T} y + 2 β_{j}^{T} x_{j}^{T} X_{- j} β_{- j} + β_{j}^{T} x_{j}^{T} x_{j} β_{j} + \frac{1}{ν} β_{j}^{T} β_{j})] = & exp [- \frac{1}{2 σ^{2}} (β_{j}^{T} (x_{j}^{T} x_{j} + \frac{1}{ν}) β_{j} - 2 β_{j}^{T} x_{j}^{T} (y - X_{- j} β_{- j}))] = & exp [- \frac{1}{2 σ^{2}} (x_{j}^{T} x_{j} + \frac{1}{ν}) (β_{j}^{T} β_{j} - 2 β_{j}^{T} (x_{j}^{T} x_{j} + \frac{1}{ν})^{- 1} x_{j}^{T} (y - X_{- j} β_{- j}))] = & exp [- \frac{1}{2 σ^{2}} (x_{j}^{T} x_{j} + \frac{1}{ν}) (β_{j} - {~ β}_{j})^{2}] \end{matrix}$ where ${~ β}_{j} = (x_{j}^{T} x_{j} + \frac{1}{ν})^{- 1} x_{j}^{T} (y - X_{- j} β_{- j})$ , $X_{- j}$ is a submatrix of $X$ without the $j^{t h}$ column, and $β_{- j}$ is a subvector of $β$ without the $j^{t h}$ element. Hence

$β_{j} | β_{- j}, σ^{2}, y \sim N ((x_{j}^{T} x_{j} + \frac{1}{ν})^{- 1} x_{j}^{T} (y - X_{- j} β_{- j}), σ^{2} {[x_{j}^{T} x_{j} + \frac{1}{ν}]}_{j}^{- 1})$

Setup

Consider

$f (y | β, σ^{2}) = (2 π σ^{2})^{- \frac{n}{2}} exp (- \frac{1}{2 σ^{2}} ∥ y - X β ∥^{2}),$ $π (β | σ^{2}) = (2 π σ^{2} ν)^{- \frac{p}{2}} exp (- \frac{1}{2 σ^{2} ν} ∥ β ∥^{2}) .$

Generate 30 samples from $y = 1 + 1 \times x_{1} + 2 \times x_{2} + 0 \times x_{3} + 0 \times x_{4} + 10 \times x_{5} + ϵ$ where $ϵ \sim N (0, 2), x_{i} \sim N (0, 1) . β = (1, 1, 2, 0, 0, 10)$

use frequentist way method to estimate $β$
use Bayesian MC
use Bayesian MCMC

Computation Algorithm

In frequentist way, that is OLS method, we know the estimate of $β$ is $^β = (X^{T} X)^{- 1} X^{T} y$ . The mean square of error is $∥^β - β ∥^{2} / 6$ .
In Bayesian MC way, we assume that

$σ^{2} \sim I G (a_{0}, b_{0})$

$β | σ^{2} \sim N (0, σ^{2} ν I_{p})$

Then we have proved that

$σ^{2} | y \sim I G (a^{*}, b^{*})$ , where $a^{*} = \frac{n}{2} + a_{0}, b^{*} = b_{0} + \frac{1}{2} y^{T} (I_{n} - X A^{- 1} X^{T}) y$

$β | y, σ^{2} \sim N (A^{- 1} X^{T} y, A^{- 1} σ^{2})$ , where $A = (X^{T} X + \frac{1}{ν} I_{p})$

hence

β^{(m)}

are generaed as follows:

Finally, we estimate $^β$ by computing $E (β | y) = \sum_{m = 1}^{M} β^{(m)} / M$ . The mean square of error is $∥^β - β ∥^{2} / 6$ .

In Bayesian MCMC way, we have proved that
1. , where
1. ,where

Then, we use Gibbs sampling method to get the . Set the initial values .Then iterate the following steps for as follows:

1. Generate from .
1. Generate from .
1. update with
1. Generate from .
1. update with and similarly, generate and update after generating every time,

where is a subvector of without element.

We setup the total sample size is 5000, and burning period is 2000.

Finally, we estimate by computing . The mean square of error is .

Simulation Results

Bayesian V.S. Frequentist
							MSE
Frequentist	1.194	0.991	2.080	-0.082	-0.051	9.608	0.035
Bayesian MC	1.195	0.972	2.084	-0.090	-0.068	9.570	0.041
Bayesian MCMC	1.187	0.986	2.071	-0.085	-0.058	9.581	0.038

Model Selection

Consider we have k candidate models, . A Bayesian way to select the best model is to find the model which maximizes the model posterior probability given data, that is

where in which, the denominator is free of . note that the prior . Then we have This leads to where , where is the parameter under model . We call as marginal likelihood under model , likelihood under model and prior under model M. Under Note that Therefore, where After Laplace approximation, is equivalent to BIC for large n.

MCMC

Minghui Chen (IWDE)

Goal is to compute

Let be a density function of

Then,

where is from .

There maxmimum of is equivalent to maximize

Now let’s use Minghui Chen’s method to maximize the .

, where Let be a density function of

Then

where is from .

The issue here is a choice of to minimize MC error.

Recall Let be a normal density such that posterior mean , posterior variance matrix , where is a MCMC sample from .

For given model M, using MCMC, generate from , then compute and .

Define .

Compute

Bayesian Simulation Case Study

Setup

Consider

Generate 30 samples from where

Therefore, the true model, say , is

1. use Bayesian MC method to find true model
1. use frequentist way method to find true model

Our candidate models are as follows:

For bayesian way, compute

for and select the best model , such that If , then assign , otherwise .

Repeat the above procedure for times, and compute selecting true model with Bayes method .

For Frequentist way, similarly, build linear model for 32 candidate models, compute AIC and BIC. If the smallest AIC or BIC is ture model, then , otherwise . Repeat it 500 times and compute .

Consider the sample sizes are 20, 50, 100 and 300 so that we can see how sample size impacts the probability of selecting the true model.

For MCMC, sampling with MCMC method, then compute Eq. (4) for 32 candidates models, select the best model , such that If , then assign , otherwise

Repeat the above procedure for times, and compute selecting true model with MCMC method

Result

	MC	AIC	BIC	MCMC	Error
20	0.502	0.560	0.656	0.508	59.401
50	0.894	0.668	0.882	0.880	78.224
100	0.998	0.676	0.916	0.926	93.176
300	0.998	0.712	0.978	NA	NA

Caleb

Bayesian Linear Model

Caleb Jin / 2019-04-18

Linear model

Maximum likelihood estimation (MLE) approach

Distribution of $^β$ and ${^σ}^{2}$

Bayesian approach

$σ^{2}$ is known

$σ^{2}$ is unknown

Monte Carlo Simulation