Caleb

Stats Ph.D.

Bayesian Linear Model

Caleb Jin / 2019-04-18


Linear model

Consider a multiple linear regression model as follows: \[{\bf y}={\bf X}{\boldsymbol \beta}+ {\boldsymbol \epsilon},\] where \({\bf y}=(y_1,y_2,\dots,y_n)^{{\bf T}}\) is the \(n\)-dimensional response vector, \({\bf X}=({\bf x}_1,{\bf x}_2, \dots, {\bf x}_n)^{{\bf T}}\) is the \(n\times p\) design matrix, and \({\boldsymbol \epsilon}\sim \mathcal{N}_n({\boldsymbol 0},\sigma^2{\bf I}_n)\). We assume that \(p<n\) and \({\bf X}\) is full rank.

Maximum likelihood estimation (MLE) approach

Since \({\bf y}\sim \mathcal{N}_n({\bf X}{\boldsymbol \beta}, \sigma^2{\bf I}_n)\), the likelihood function is given as \[\begin{eqnarray*} L({\boldsymbol \beta},\sigma^2)&=&f({\bf y}|{\boldsymbol \beta},\sigma^2)\\ &= &(2\pi)^{-\frac{n}{2}}\lvert {\boldsymbol \Sigma}\lvert^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}({\bf y}-{\bf X}{\boldsymbol \beta})^{{\bf T}}{\boldsymbol \Sigma}^{-1}({\bf y}-{\bf X}{\boldsymbol \beta})\right\} , \end{eqnarray*}\]

where \({\boldsymbol \Sigma}=\sigma^2 {\bf I}_n\). Then the log likelihood can be written as \[\begin{eqnarray*} l({\boldsymbol \beta},\sigma^2)&=&\log L({\boldsymbol \beta},\sigma^2)\\ &=&-\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}({\bf y}-{\bf X}{\boldsymbol \beta})^{{\bf T}}({\bf y}-{\bf X}{\boldsymbol \beta}). \end{eqnarray*}\] Note that \(l({\boldsymbol \beta},\sigma^2)\) is a concave function in \(({\boldsymbol \beta},\sigma^2)\). Hence, the maximum likelihood estimator (MLE) can be obtained by solving the following equations: \[\begin{eqnarray*} \frac{\partial l({\boldsymbol \beta},\sigma^2)}{\partial {\boldsymbol \beta}} &=&- \frac{1}{2\sigma^2}(-2{\bf X}^{{\bf T}} {\bf y}+ 2{\bf X}^{{\bf T}} {\bf X}{\boldsymbol \beta})=0; \\ \frac{\partial l({\boldsymbol \beta},\sigma^2)}{\partial \sigma^2} &=&-\frac{n}{2}\frac{1}{\sigma^2} + \frac{1}{2}\frac{1}{(\sigma^2)^2} \mathbf{({\bf y}-{\bf X}{\boldsymbol \beta})^T({\bf y}-{\bf X}{\boldsymbol \beta})}=0. \end{eqnarray*}\] Therefore, the MLEs of \({\boldsymbol \beta}\) and \(\sigma^2\) are given as \[\begin{eqnarray*} \hat{\beta} &=& ({\bf X}^{{\bf T}} {\bf X})^{-1}{\bf X}^{{\bf T}} {\bf y};\\ \hat{\sigma}^2 &=& \frac{({\bf y}-{\bf X}\hat{{\boldsymbol \beta}})^{{\bf T}}({\bf y}-{\bf X}\hat{{\boldsymbol \beta}})}{n}=\frac{ \|{\bf y}-\hat{{\bf y}}\|^2}{n}, \end{eqnarray*}\] where \(\hat{{\bf y}}={\bf X}\hat{{\boldsymbol \beta}}\).

Distribution of \(\hat{\boldsymbol \beta}\) and \(\hat\sigma^2\)

Note that if \({\bf z}\sim \mathcal{N}_k(\mu,\Sigma)\), then \(A{\bf z}\sim \mathcal{N}_m(A\mu,A\Sigma A^{{\bf T}})\), where \(A\) is an \(m\times k\) matrix. Since \({\bf y}\sim \mathcal{N}_n({\bf X}{\boldsymbol \beta}, \sigma^2{\bf I}_n)\) and \(\hat{{\boldsymbol \beta}} =({\bf X}^{{\bf T}} {\bf X})^{-1}{\bf X}^{{\bf T}}{\bf y}\), we have \[\begin{eqnarray*} \hat{{\boldsymbol \beta}} &\sim& \mathcal{N}_p\left( ({\bf X}^{{\bf T}} {\bf X})^{-1}{\bf X}^{{\bf T}}{\bf X}{\boldsymbol \beta}, \sigma^2 ({\bf X}^{{\bf T}} {\bf X})^{-1}{\bf X}^{{\bf T}}{\bf X}({\bf X}^{{\bf T}} {\bf X})^{-1}\right)\\ &=&\mathcal{N}_p\left({\boldsymbol \beta}, \sigma^2 ({\bf X}^{{\bf T}} {\bf X})^{-1}\right). \end{eqnarray*}\]

Note that \({\bf y}-\hat{{\bf y}}=({\bf I}_n-{\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}){\bf y}\), where \({\bf I}_n-{\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}\) is an idempotent matrix with rank \((n-p)\).

\(\color{red}{\text{Can you prove that ${\bf I}_n-{\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}$ s an idempotent matrix of rank $(n-p)$ ?}}\)

Proof. Let \({\bf H}= {\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}\).

\({\bf H}{\bf H}= {\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}{\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}={\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}={\bf H}\) thus \({\bf H}\) is idempotent matrix.

Similarly, as \(({\bf I}_n-{\bf H})({\bf I}_n-{\bf H}) = {\bf I}_n-{\bf H}\), \(({\bf I}_n-{\bf H})\) is also idempotent.

Hence we have \(r({\bf I}_n-{\bf H})=tr({\bf I}_n-{\bf H})=n-tr({\bf H})=n-tr(({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}{\bf X})=n-p\)

\(\color{red}{\text{How to prove $r({\bf I}_n-{\bf H})=tr({\bf I}_n-{\bf H})$? }}\)

The eigenvalue of idempotent matrix is either 1 or 0, hence the rank of it is the sum of eigenvalues, which equals the trace of matrix;

\(\color{red}{\text{How to prove trace of matrix is the sum of eigenvalues?}}\)

It requires characteristic polynomial.

Another way to prove this is in this link

From Lemma 1, we have that \[\begin{eqnarray} \tag{1} n\frac{\hat{\sigma}^2}{\sigma^2}\sim \chi^2(n-p), \end{eqnarray}\] where \(\chi^2(n-p)\) denotes the chi-squared distribution with degrees of freedom \(n-p\).

\({\color{red}{\text{Can you prove Eq.}}}\)(1)?

Proof. By Lemma 1, we have
\(n\frac{\hat{\sigma}^2}{\sigma^2}=\frac{\hat e^{{\bf T}}\hat e}{\sigma^2}={\bf y}^{{\bf T}}(\frac{{\bf I}_n-{\bf H}}{\sigma^2}) {\bf y}={\bf y}^{{\bf T}}{\bf A}{\bf y}\sim\chi^2(tr({\bf A}{\boldsymbol \Sigma}),\mu^{{\bf T}}{\bf A}\mu/2)\)

where \({\bf A}=\frac{{\bf I}_n-{\bf H}}{\sigma^2}\), \(\mu^{{\bf T}}{\bf A}\mu/2=({\bf X}{\boldsymbol \beta})^{{\bf T}}(\frac{{\bf I}_n-{\bf H}}{\sigma^2})({\bf X}{\boldsymbol \beta})/2=0\). \({\bf A}{\boldsymbol \Sigma}= (\frac{{\bf I}_n-{\bf H}}{\sigma^2})\sigma^2{\bf I}_n={\bf I}_n-{\bf H},\) hence \(r({\bf A}{\boldsymbol \Sigma}) = n-p\). Therefore \(n\frac{\hat{\sigma}^2}{\sigma^2}\sim \chi^2(n-p)\).
Lemma 1 Let \({\bf z}\sim \mathcal{N}_k(\mu,\Sigma)\) with \(r(\Sigma)=k\), where \(r(\Sigma)\) denotes the rank of \(\Sigma\). If \({\bf A}{\boldsymbol \Sigma}\) (or \({\boldsymbol \Sigma}{\bf A}\)) is an idempotent matrix of rank \(m\), then \({\bf z}^{{\bf T}}{\bf A}{\bf z}\) follows the non-central chi-squared distribution with degrees of freedom \(m\) and non-central parameter \(\lambda=\mu^{{\bf T}}{\bf A}\mu/2\).

Bayesian approach

\(\sigma^2\) is known

Suppose \(\sigma^2\) is known. We define the prior distribution of \({\boldsymbol \beta}\) by \({\boldsymbol \beta}\sim \mathcal{N}_p({\boldsymbol 0},\sigma^2\nu{\bf I}_p)\). Then the posterior density of \({\boldsymbol \beta}\) can be obtained by \[\begin{eqnarray*} \pi ({\boldsymbol \beta}|{\bf y}) &\propto& f({\bf y}|{\boldsymbol \beta})\pi ({\boldsymbol \beta}) \\ &\propto& \exp\left(-\frac{1}{2\sigma^2}\| {\bf y}-{\bf X}{\boldsymbol \beta}\|^2\right)\times \exp\left(-\frac{1}{2\sigma^2\nu}\|{\boldsymbol \beta}|^2\right)\\ &=&\exp\left[-\frac{1}{2\sigma^2}\left\{ ({\bf y}-{\bf X}{\boldsymbol \beta})^{{\bf T}}({\bf y}-{\bf X}{\boldsymbol \beta}) + \frac{1}{\nu}{\boldsymbol \beta}^{{\bf T}} {\boldsymbol \beta}\right\}\right]\\ &\propto&\exp\left\{-\frac{1}{2\sigma^2}\left( - 2 {\boldsymbol \beta}^{{\bf T}} {\bf X}^{{\bf T}} {\bf y}+ {\boldsymbol \beta}^{{\bf T}} {\bf X}^{{\bf T}} {\bf X}{\boldsymbol \beta}+ \frac{1}{\nu}{\boldsymbol \beta}^{\bf T}{\boldsymbol \beta} \right)\right\}\\ &=&\exp\left\{-\frac{1}{2\sigma^2}\left( - 2 {\boldsymbol \beta}^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right) \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right)^{-1} {\bf X}^{{\bf T}} {\bf y}+ {\boldsymbol \beta}^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right){\boldsymbol \beta}\right)\right\}\\ &\propto&\exp\left\{-\frac{1}{2\sigma^2}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}})^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right)({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}}) \right\} \end{eqnarray*}\] where \(\tilde{{\boldsymbol \beta}}=\left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}\right)^{-1} {\bf X}^{{\bf T}} {\bf y}\).

This implies that \[\begin{eqnarray} {\boldsymbol \beta}|{\bf y}\sim \mathcal{N}_p\left(\left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right)^{-1} {\bf X}^{{\bf T}} {\bf y},\sigma^{2} \left({\bf X}^{{\bf T}} {\bf X} +\frac{1}{\nu}I_p\right)^{-1}\right).\tag{2} \end{eqnarray}\] The Bayesian estimate \(\hat{\boldsymbol \beta}_{Bayes} = \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right)^{-1} {\bf X}^{{\bf T}} {\bf y}\). It is worth noting that if \(\nu\to \infty\), the posterior distribution converges to the distribution of \(\hat{{\boldsymbol \beta}}_{MLE}\sim \mathcal{N}_p\left({\boldsymbol \beta}, \sigma^2 ({\bf X}^{{\bf T}} {\bf X})^{-1}\right)\).

\(\sigma^2\) is unknown

In general, \(\sigma^2\) is unknown. Then, we need to assign a reasonable prior distribution for \(\sigma^2\). We consider the inverse gamma distribution, which is called a , as follows: \(\sigma^2\sim \mathcal{IG}(a_0,b_0)\) with the density function \[\begin{eqnarray} \pi(\sigma^2)=\frac{b_0^{a_0}}{\Gamma(a_0)}(\sigma^2)^{-a_0-1}\exp\left(-\frac{b_0}{\sigma^2}\right), \end{eqnarray}\] where \(a_0>0\) and \(b_0>0\). In addition, we need to introduce prior for \({\boldsymbol \beta}|\sigma^2\):

\[{\boldsymbol \beta}|\sigma^2 \sim \mathcal{N}_p({\boldsymbol 0},\sigma^2\nu{\bf I}_p).\]

\(\color{red}{\text{Today, we derive the joint posterior distribution} \pi({\boldsymbol \beta},\sigma^2|{\bf y})\propto f({\bf y}|{\boldsymbol \beta},\sigma)\pi({\boldsymbol \beta}|\sigma^2)\pi(\sigma^2).}\)

Show

    1. \(\pi(\sigma^2|{\boldsymbol \beta},{\bf y}) = \mathcal{IG}(a^*,b^*)\)
    1. \(\pi({\boldsymbol \beta}|{\bf y}) \sim\) t-distribution with \(t^*\)
Then the posterior density function of \(\sigma^2\) given \({\boldsymbol \beta},{\bf y}\) can be obtained by

Proof (1). \[\begin{eqnarray*} &&\pi(\sigma^2|{\boldsymbol \beta},{\bf y}) = \frac{f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta},\sigma^2)}{\int f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta},\sigma^2)d\sigma^2} = \frac{f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta}|\sigma^2)\pi(\sigma^2)}{\int f({\bf y},{\boldsymbol \beta},\sigma^2)d\sigma^2}\\ &\propto& f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta}|\sigma^2)\pi(\sigma^2)\\ &\propto& (\sigma^2)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\sigma^2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2\right) \times (\sigma^2)^{-\frac{p}{2}}\exp\left( -\frac{1}{2\sigma^2\nu}\|{\boldsymbol \beta}\|^2\right)\times (\sigma^2)^{-a_0-1}\exp \left(-\frac{b_0}{\sigma^2}\right)\\ &=& (\sigma^2)^{-(\frac{n+p}{2}+a_0)-1}\exp \left[-\frac{1}{\sigma^2}\left\{\frac{1}{2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2+\frac{1}{2\nu}\|{\boldsymbol \beta}\|^2 + b_0 \right\}\right]\\ &=& (\sigma^2)^{-a^*-1} \exp(-\frac{b^*}{\sigma^2}) \end{eqnarray*}\] where \(a^* = \frac{n+p}{2}+a_0\), \(b^* = \frac{1}{2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2+\frac{1}{2\nu}\|{\boldsymbol \beta}\|^2 + b_0\).

This implies that \[\begin{eqnarray} \sigma^2|{\boldsymbol \beta},{\bf y}\sim \mathcal{IG}\left(\frac{n+p}{2}+a_0, \frac{1}{2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2+\frac{1}{2\nu}\|{\boldsymbol \beta}\|^2 + b_0\right). \end{eqnarray}\]


Proof (2). \[\begin{eqnarray*} &&\pi({\boldsymbol \beta}|{\bf y}) \\ &=& \int\pi({\boldsymbol \beta},\sigma^2|{\bf y})d\sigma^2\\ &=& \int \frac{f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta},\sigma^2)}{\iint f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta},\sigma^2)d{\boldsymbol \beta}d\sigma^2}d\sigma^2\\ &\propto& \int f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta}|\sigma^2)\pi(\sigma^2) d\sigma^2\\ &\propto&\Gamma(a^*)b^{-a^*}\\ &\propto&{b^*}^{-a^*}\\ &=&[\frac{1}{2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2+\frac{1}{2\nu}\|{\boldsymbol \beta}\|^2 + b_0]^{-a^*}\\ &\propto&\left[({\bf y}-{\bf X}{\boldsymbol \beta})^T({\bf y}-{\bf X}{\boldsymbol \beta})+\frac{1}{\nu}{\boldsymbol \beta}^T{\boldsymbol \beta}- 2b_0 \right]^{-a^*}\\ &=&\left[{\bf y}^T{\bf y}- 2{\boldsymbol \beta}^T{\bf X}^T{\bf y}+ {\boldsymbol \beta}^T{\bf X}^T{\bf X}{\boldsymbol \beta}+ \frac{1}{\nu}{\boldsymbol \beta}^T{\boldsymbol \beta}+2b_0 \right]^{-a^*}\\ &\propto& \left[ {\boldsymbol \beta}^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_p\right){\boldsymbol \beta}- 2 {\boldsymbol \beta}^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_p\right) \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_p\right)^{-1} {\bf X}^{{\bf T}} {\bf y}+ {\bf y}^T{\bf y}+2b_0 \right]^{-a^*}\\ &=& \left[ {\boldsymbol \beta}^{{\bf T}}{\bf A}{\boldsymbol \beta}- 2{\boldsymbol \beta}^{{\bf T}}{\bf A}\tilde{{\boldsymbol \beta}} + \tilde{{\boldsymbol \beta}}^{{\bf T}} {\bf A}\tilde{{\boldsymbol \beta}} - \tilde{{\boldsymbol \beta}}^{{\bf T}}{\bf A}\tilde{{\boldsymbol \beta}} + {\bf y}^{{\bf T}}{\bf y}+2b_0\right]^{-a^*}\\ &=& \left[ ({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}}) + {\bf y}^{{\bf T}}{\bf y}- {\bf y}^{{\bf T}}{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}{\bf y}+ 2b_0\right]^{-a^*}\\ &=& \left[ ({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}}) + {\bf y}^{{\bf T}}({\bf I}_n-{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}+ 2b_0\right]^{-a^*}\\ &\propto& \left[({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}}) + c^* \right]^{-a^*}\\ &\propto&\left[1+ \frac{1}{c^*}({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}})\right]^{-\frac{n+p+2a_0}{2}}\\ &=&\left[1+ \frac{\nu^*}{\nu^*c^*}({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}})\right]^{-\frac{p+\nu^*}{2}}\\ &=&\left[1+ \frac{1}{\nu^*}({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T(\frac{c^*}{\nu^*}{\bf A}^{-1})^{-1}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}})\right]^{-\frac{p+\nu^*}{2}}\\ \end{eqnarray*}\] This implies that \[\begin{eqnarray} {\boldsymbol \beta}|{\bf y}\sim {\bf \mathcal{MT}}(\tilde{{\boldsymbol \beta}},\frac{c^*}{\nu^*}{\bf A}^{-1},\nu^*), \tag{3} \end{eqnarray}\]

where

\[\begin{eqnarray*} {\bf A}&=&{\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_p,\\ \tilde{{\boldsymbol \beta}} &=& {\bf A}^{-1} {\bf X}^{{\bf T}} {\bf y},\\ c^* &=& {\bf y}^{{\bf T}}({\bf I}_n-{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}+ 2b_0,\\ \nu^* &=& n+2a_0. \end{eqnarray*}\]


Note: the density of a multiple t-distribution with \({\boldsymbol \Sigma},{\boldsymbol \mu}\) and \(\nu\) is (3) \[\begin{eqnarray*} \frac{\Gamma[(\nu+p)/2]}{\Gamma(\nu/2)\nu^{p/2}\pi^{p/2}|{\boldsymbol \Sigma}|^{1/2}}\left[1+\frac{1}{\nu}({\bf X}- {\boldsymbol \mu})^{{\bf T}} \Sigma^{-1}({\bf X}-{\boldsymbol \mu})\right]^{-(\nu+p)/2}. \end{eqnarray*}\]

Monte Carlo Simulation

Suppose \(\sigma^2\) is unknown. If \({\boldsymbol \beta}\sim \mathcal{N}_p({\boldsymbol 0},\sigma^2\nu{\bf I}_p)\), we know the distribution of \({\boldsymbol \beta}|{\bf y}\) is Eq. (3). According to this known distribution, we can easily compute the \(E({\boldsymbol \beta}|{\bf y})\) and \(V({\boldsymbol \beta}|{\bf y})\). But in practice, the form of density function \(\pi({\boldsymbol \beta}|{\bf y})\) is often an unknown and very complicated distribution, leading to be impossible to solve its integration directly. hence we turn to use numerical methods to solve this problem, such as Monte Carlo methods.

For example, a form density function \(\pi(\theta|{\bf y})\) is an unknown distribution. \(E(\theta|{\bf y}) = \int \theta\pi (\theta|{\bf y})d\theta\). According to lemma , \(\bar X_{n} \rightarrow \mu = E(X)\) as \(n \rightarrow \infty\). Therefore, if we indepedently generate \(\theta^{(1)}, \theta^{(2)}, \ldots,\theta^{(M)}\) from \(\pi(\theta|{\bf y})\), \(M^{-1} \sum_{k=1}^M\theta^{(k)} \rightarrow E(\theta|{\bf y})\) as M \(\rightarrow \infty\). However, there are two problems.

  1. What if we cannot generate sample from \(\pi(\theta|{\bf y})\)

  2. What if they are not identically and independently distributed.

The solutions to these two question are Monte Carlo (MC) method and Markov Chain Monte Carlo (MCMC).

For the first question, we can use importance sampling method.

Lemma 2 (Strong Law of Large Number)

Let \(X_{1}, X_{2},...\) be a sequence of i.i.d random variables, each having finite mean m. Then \[\Pr\left(\lim_{n\rightarrow\infty} \frac{1}{n}(X_{1}+X_{2}+...+X_{n} = m)\right) = 1.\]

\(\sigma^2\) is known

Importance Sampling

To estimate mean value of parameters, usually we randomly sample from \(\pi(\theta|{\bf y})\) and compute their mean value to estimate parameters. However, in practice, it is hard to generate sample directly from \(\pi(\theta|{\bf y})\), because we don’t know what speccific distribution it belongs to. hence we return to importance sampling method.

Note that \[\begin{eqnarray*} E(\theta|{\bf y}) &=& \int \theta\pi(\theta|{\bf y})d\theta\\ &=& \int \theta \frac{\pi(\theta|{\bf y})}{g(\theta)}g(\theta)d\theta\\ &=& \int h(\theta)g(\theta)d\theta \\ &=& E\{h(\theta)\} \end{eqnarray*}\] where \(h(\theta)=\theta \frac{\pi(\theta|{\bf y})}{g(\theta)}, g(\theta)\) is a known pdf.

To estimate the variance of parameters, similarly, \[\begin{eqnarray*} Var(\theta|{\bf y}) &=& E[\theta - E(\theta|{\bf y}) ]^2 = \int (\theta- E(\theta|{\bf y}))^2\pi(\theta|{\bf y}) d\theta\\ &=& \int (\theta- E(\theta|{\bf y}))^2 \frac{\pi(\theta|{\bf y})}{g(\theta)}g(\theta)d\theta\\ &=& \int h^{'}(\theta)g(\theta)d\theta \\ &=& E\{h^{'}(\theta)\} \end{eqnarray*}\] where \(h^{'}(\theta)=(\theta- E(\theta|{\bf y}))^2 \frac{\pi(\theta|{\bf y})}{g(\theta)},\) and \(g(\theta)\) is a known pdf.

The importance sampling method can be implemented as follows:

Note that \[ \sum_{i=1}^M \frac{h(\theta_{i})}{M} \rightarrow E(h(\theta)) = E(\theta|{\bf y}).\quad as \quad M \rightarrow\infty \] Estimating variance by importance sampling method is similarly.

Importance Sampling Simulation

Setup

Consider a single linear regression model as follows:

\[y_i=\beta_0 +\beta_1 x_i+ \varepsilon_i\] where \(\varepsilon_i \sim N(0,1), x_i\sim N(0,1), \beta_0 = 1, \beta_1 = 2,\) for \(i=1,\ldots,100.\)

We already know that if \({\boldsymbol \beta}\sim \mathcal{N}_p({\boldsymbol 0},\sigma^2\nu{\bf I}_p)\), then the distribution of \({\boldsymbol \beta}|{\bf y}\) is Eq.(2). Assume \(\sigma = 1\) is known and consider a known pdf \(\pi({\boldsymbol \beta}) = \phi({\boldsymbol \beta};(0,0), 10I_2)\), hence in this case, \(\nu = 10\). Our simulation can be broken down into 3 steps in following:

  1. Find exact value of \(E(\beta_{0}|{\bf y}), E(\beta_{1}|{\bf y}), V(\beta_{0}|{\bf y})\) and \(V(\beta_{1}|{\bf y})\)

  2. Use the MC method to simulate the results above with size of 100, 1000 and 5000.

  3. Use Importance Sampling Method to simulate relevant results with the same size in (2). Consider the known pdf of parameters are \(\phi(\beta_0|1, 0.5)\) and \(\phi(\beta_1|2, 0.5).\)

Simulation Results

We use R softeware to do this simulation.

  1. According to Eq. (2), compute \(E({\boldsymbol \beta}|{\bf y})=\left({\bf X}^{{\bf T}} {\bf X}+ \frac{1}{\nu}I_p \right)^{-1} {\bf X}^{{\bf T}} {\bf y}\) and \(Var({\boldsymbol \beta}|{\bf y})=\sigma^2 \left({\bf X}^{{\bf T}} {\bf X}+ \frac{1}{\nu}I_p \right)^{-1}\), where \(\sigma =1\) and \(\nu = 10\), then we get the following results:
    \(E(\beta_0)\) \(E(\beta_1)\) \(Var(\beta_0)\) \(Var(\beta_1)\)
    1.108 1.997 0.01 0.011
  2. Sample directly from normal distribution of Eq. (2) with sample size of 100, 1000 and 5000, then compute the mean and variance values of samples. hence we get the following results:
\(E(\beta_0)\) \(E(\beta_1)\) \(Var(\beta_0)\) \(Var(\beta_1)\)
100 1.104 2.002 0.009 0.012
1000 1.107 1.993 0.011 0.012
5000 1.108 1.996 0.010 0.011
  1. Sample \(\beta_0\) and \(\beta_1\) directly from \(\phi(\beta_0|1, 0.5)\) and \(\phi(\beta_1|2, 0.5)\), respectively, with sample size of 100, 1000 and 5000. And then compute \(h(\theta_i)\) and \(h^{'}(\theta_i)\). Compute mean values of them to get estimates of expectation and variance. The final results are following:
    \(E(\beta_0)\) \(E(\beta_1)\) \(Var(\beta_0)\) \(Var(\beta_1)\)
    100 1.0500 2.0623 0.0175 0.0127
    1000 1.1323 1.9521 0.0106 0.0127
    5000 1.1269 2.0478 0.0107 0.0137

\(\sigma^2\) is unknown

Suppose \(\sigma^2\) is unknown, we cannot use \(\pi({\boldsymbol \beta}|{\bf y},\sigma^2)\) directly.

But we know \(\pi({\boldsymbol \beta},\sigma^2|{\bf y}) = \pi({\boldsymbol \beta}|{\bf y},\sigma^2)\pi(\sigma^2|{\bf y})\) and we already know

\({\boldsymbol \beta}|{\bf y},\sigma^2 \sim \mathcal{N}_p\left(({\bf X}^{{\bf T}}{\bf X}+\frac{1}{\nu}I_p)^{-1}{\bf X}^{{\bf T}}{\bf y}, ({\bf X}^{{\bf T}}{\bf X}+\frac{1}{\nu}I_p)^{-1}\sigma^2\right)\)

So \[\begin{eqnarray*} &&\pi(\sigma^2|{\bf y}) = \int \pi(\sigma^2, {\boldsymbol \beta}| {\bf y}) d{\boldsymbol \beta} \propto \int f({\bf y}| \sigma^2, {\boldsymbol \beta})\pi({\boldsymbol \beta}|\sigma^2)\pi(\sigma^2)d{\boldsymbol \beta} =\int f({\bf y}|\sigma^2,{\boldsymbol \beta})\pi({\boldsymbol \beta}|\sigma^2)d{\boldsymbol \beta}\times \pi(\sigma^2)\\ &\propto& \int (\sigma^2)^{-\frac{n}{2}} \exp\left[-\frac{1}{2\sigma^2}({\bf y}- {\bf X}{\boldsymbol \beta})^{{\bf T}} ({\bf y}- {\bf X}{\boldsymbol \beta})\right](\sigma^2)^{-\frac{p}{2}}\exp(-\frac{1}{2\sigma^2\nu}{\boldsymbol \beta}^{{\bf T}}{\boldsymbol \beta}) d{\boldsymbol \beta}\times \pi(\sigma^2)\\ &\propto& \int \exp\left[-\frac{1}{2\sigma^2}({\bf y}^{{\bf T}}{\bf y}-2 {\boldsymbol \beta}^{{\bf T}}{\bf X}^{{\bf T}}{\bf y}+ {\boldsymbol \beta}^{{\bf T}}{\bf X}^{{\bf T}}{\bf X}{\boldsymbol \beta}+ \frac{1}{\nu}{\boldsymbol \beta}^{{\bf T}}{\boldsymbol \beta})\right]d{\boldsymbol \beta}\left((\sigma^2 )^{-\frac{1}{2}(n + p)}\pi(\sigma^2)\right)\\ &=& \int \exp \left[-\frac{1}{2\sigma^2}\left({\boldsymbol \beta}^{{\bf T}}({\bf X}^{{\bf T}}{\bf X}+\frac{1}{\nu}I_p) {\boldsymbol \beta}-2{\boldsymbol \beta}^{{\bf T}}({\bf X}^{{\bf T}}{\bf X}+\frac{1}{\nu}I_p)({\bf X}^{{\bf T}}{\bf X}+\frac{1}{\nu}I_p)^{-1}{\bf X}^{{\bf T}} {\bf y}+ {\bf y}^{{\bf T}}{\bf y}\right)\right]d{\boldsymbol \beta}\\ &&\left((\sigma^2)^{-\frac{1}{2}(n + p)}\pi(\sigma^2)\right)\\ &=&\int\exp \left[-\frac{1}{2\sigma^2}({\boldsymbol \beta}- \tilde {\boldsymbol \beta})^{{\bf T}}{\bf A}({\boldsymbol \beta}- \tilde {\boldsymbol \beta}) + {\bf y}^{{\bf T}}{\bf y}-\tilde {\boldsymbol \beta}^{{\bf T}}{\bf A}\tilde {\boldsymbol \beta}\right]d{\boldsymbol \beta} \left((\sigma^2)^{-\frac{1}{2}(n + p)}\pi(\sigma^2)\right)\\ &=&\int\exp \left[-\frac{1}{2\sigma^2}({\boldsymbol \beta}- \tilde {\boldsymbol \beta})^{{\bf T}}{\bf A}({\boldsymbol \beta}- \tilde {\boldsymbol \beta}) \right]d{\boldsymbol \beta}\times \exp\left[-\frac{1}{2\sigma^2}\left({\bf y}^{{\bf T}}{\bf y}-{\bf y}^{{\bf T}} {\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}{\bf y}\right)\right]\left((\sigma^2)^{-\frac{1}{2}(n + p)}\pi(\sigma^2)\right)\\ &=& (2\pi)^{\frac{p}{2}} \lvert\sigma^2{\bf A}^{-1}\rvert^{\frac{1}{2}}\exp\left[-\frac{1}{2\sigma^2} \left({\bf y}^{{\bf T}}(I_n - {\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}\right)\right]\left((\sigma^2)^{-\frac{1}{2}(n + p)} \pi(\sigma^2)\right)\\ &\propto&(\sigma^2)^{\frac{p}{2} - \frac{1}{2}(n+p)}(\sigma^2)^{-a_0 -1} \exp \left[-\frac{1}{\sigma^2}\left(b_0 + \frac{1}{2} {\bf y}^{{\bf T}}(I_n - {\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}\right) \right] \\ &=& (\sigma^2)^{-(\frac{1}{2}n + a_0)-1}\exp \left[-\frac{1}{\sigma^2} \left(b_0 + \frac{1}{2} {\bf y}^{{\bf T}}(I_n - {\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}\right)\right] \\ \\ &=& (\sigma^2)^{-a^*-1}\exp(-\frac{b^*}{\sigma^2}) \end{eqnarray*}\] where \({\bf A}= ({\bf X}^{{\bf T}}{\bf X}+\frac{1}{\nu}I_p),\quad \tilde {\boldsymbol \beta}= {\bf A}^{-1}{\bf x}^{{\bf T}}{\bf y}, \quad a^* =\frac{1}{2}n+ a_0,\quad b^*=b_0 + \frac{1}{2} {\bf y}^{{\bf T}}(I_n - {\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}\).

This is proportional to the pdf of \(\mathcal{IG}(a^*, b^*)\). Hence we have \[\begin{eqnarray*} E({\boldsymbol \beta}|y) &=& \int{\boldsymbol \beta}\pi({\boldsymbol \beta}|{\bf y})d{\boldsymbol \beta}= \int{\boldsymbol \beta}[ \int\pi({\boldsymbol \beta},\sigma^2|{\bf y})d\sigma^2 ]d{\boldsymbol \beta}\\ &=& \int\int{\boldsymbol \beta}[\pi({\boldsymbol \beta}|{\bf y},\sigma^2)\pi(\sigma^2|{\bf y})d\sigma^2]d{\boldsymbol \beta}, \end{eqnarray*}\] Similarly, \[\begin{eqnarray*} Var({\boldsymbol \beta}|y) &=& \int({\boldsymbol \beta}-E({\boldsymbol \beta}|y))\pi({\boldsymbol \beta}|{\bf y})d{\boldsymbol \beta}= \int({\boldsymbol \beta}-E({\boldsymbol \beta}|y))[ \int\pi({\boldsymbol \beta},\sigma^2|{\bf y})d\sigma^2 ]d{\boldsymbol \beta}\\ &=&\int \int({\boldsymbol \beta}-E({\boldsymbol \beta}|y))[\pi({\boldsymbol \beta}|{\bf y},\sigma^2)\pi(\sigma^2|{\bf y})d\sigma^2]d{\boldsymbol \beta}, \end{eqnarray*}\]

Plug-in Sampling Method

We can estimate mean of \({\boldsymbol \beta}\) by computing \(E({\boldsymbol \beta}|{\bf y})=\sum_{m=1}^{M} {\boldsymbol \beta}^{(m)} /M\), where \({\boldsymbol \beta}^{(m)}\) are generaed as follows:

    1. Generate \(\sigma^{2(m)}\) from \(\pi(\sigma^2|{\bf y})\), where \(\sigma^2|{\bf y}\) is from \(\mathcal{IG}(a^*,b^*)\)
    1. For given \(\sigma^{2(m)}\), generate \({\boldsymbol \beta}^{(m)}\) from \(\pi({\boldsymbol \beta}|{\bf y}, \sigma^{2(m)})\), where it is from normal distribution in Eq. (2)

Simulation Results

We use the same setup in the Section Importance Sampling Simulation. The final result is as follows:
\(E(\beta_0)\) \(E(\beta_1)\) \(Var(\beta_0)\) \(Var(\beta_1)\)
100 1.1099 1.9960 0.0073 0.0107
1000 1.1049 2.0004 0.0079 0.0084
5000 1.1081 1.9966 0.0081 0.0088

Markov Chain Monte Carlo Simulation

Gibbs Samping

Suppose \(\sigma^2\) is unknown, we know \[ \sigma^2|{\bf y}, \beta_0, \beta_1 \sim \mathcal{IG}(a^*, b^*) \] \[ {\boldsymbol \beta}|{\bf y},\sigma^2\sim\mathcal{N}_p\left((X^{T}X+\frac{1}{\nu}I_p)^{-1}X^{T}y,(X^{T}X+\frac{1}{\nu}I_p)^{-1}\sigma^2\right) \] hence we can get the conditional distribution of each parameter, \[ \beta_0|\beta_1,{\bf y},\sigma^2 \sim\mathcal{N}(\mu_0 + \frac{\sigma_0}{\sigma_1}\rho(\beta_1-\mu_1), (1-\rho^2)\sigma_1^2) \] \[ \beta_1|\beta_0,{\bf y},\sigma^2 \sim\mathcal{N}(\mu_1 + \frac{\sigma_1}{\sigma_0}\rho(\beta_0-\mu_0), (1-\rho^2)\sigma_0^2) \] where \(\sigma_0=[1,0]A^{-1}\sigma^2 [1,0]^{{\bf T}}\), \(\sigma_1=[0,1]A^{-1}\sigma^2 [0,1]^{{\bf T}}\), \(\mu_0=A^{-1}{\bf X}^{{\bf T}}{\bf y}[1,0]^{{\bf T}}\) and \(\mu_1=A^{-1}{\bf X}^{{\bf T}}{\bf y}[0,1]^{{\bf T}}\) .

To generate \({\boldsymbol \beta}^{(m)}\) from \(\pi({\boldsymbol \beta}|{\bf y})\), we can use a MCMC method of Gibbs sampling. The Gibbs sampling, which is one of the most popular MCMC methods, can be implemented as follows:

Set the initial value \((\beta_0^{(0)},\beta_1^{(0)},\sigma^{2(0)})\). Then iterate the following steps for\(t=0,1,2,\ldots\).

Step1: Generate \(\sigma^{2(t+1)}\) from \(\pi(\sigma^2|{\bf y},\beta_0^{(t)},\beta_1^{(t)})\).

Step2: Generate \(\beta_0^{(t+1)}\) from \(\pi(\beta_0|{\bf y},\beta_1^{(t)},\sigma^{2(t+1)})\).

Step3: Generate \(\beta_1^{(t+1)}\) from \(\pi(\beta_1|{\bf y},\beta_0^{(t+1)},\sigma^{2(t+1)})\).

We setup the total sample size is 5000, and burning period is 2000.

The final result is as follows:

\(E(\beta_0)\) \(E(\beta_1)\) \(Var(\beta_0)\) \(Var(\beta_1)\)
5000 1.1096 1.9955 0.0082 0.009

The diagnosis tool to assess the convergence of the sampler is as follows:

$\beta_0$; $\beta_1$; the marginals and the joint distribution$\beta_0$; $\beta_1$; the marginals and the joint distribution$\beta_0$; $\beta_1$; the marginals and the joint distribution

Figure 1: \(\beta_0\); \(\beta_1\); the marginals and the joint distribution

Bayesian V.S. Frequentist Case Study

Multivariable Normal Conditional Distribution

  • The conditional posterior of \(\beta_j\) is obtained as follows: \[\begin{eqnarray*} \pi(\beta_j|{\boldsymbol \beta}_{-j},\sigma^2,{\bf y}) &\propto& \pi(\beta_j,{\boldsymbol \beta}_{-j}|\sigma^2,{\bf y})\\ &=& \pi({\boldsymbol \beta}|\sigma^2,{\bf y})\\ &\propto&f({\boldsymbol \beta},{\bf y}|\sigma^2)\\ &=& f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta}|\sigma^2)\\ &\propto&\exp \left(-\frac{1}{2\sigma^2}\parallel {\bf y}-{\bf X}{\boldsymbol \beta}\parallel^2 \right) \exp\left(-\frac{1}{2\sigma^2\nu}\parallel{\boldsymbol \beta}\parallel^2\right)\\ &\propto&\exp\left[-\frac{1}{2\sigma^2}\left(({\bf y}-{\bf X}_{-j}{\boldsymbol \beta}_{-j}-x_j\beta_j)^{{\bf T}} ({\bf y}-{\bf X}_{-j}{\boldsymbol \beta}_{-j}-x_j\beta_j)+\frac{1}{\nu}\beta_j^{{\bf T}}\beta_j \right) \right]\\ &\propto&\exp \left[-\frac{1}{2\sigma^2} (-{\bf y}^{{\bf T}}x_j \beta_j + {\boldsymbol \beta}_{-j}^{{\bf T}}{\bf X}_{-j}^{{\bf T}}x_j \beta_j- \beta_j^{{\bf T}}x_j^{{\bf T}}{\bf y}+ \beta_j^{{\bf T}}x_j^{{\bf T}} {\bf X}_{-j}{\boldsymbol \beta}_{-j}+ \beta_j^{{\bf T}}x_j^{{\bf T}}x_j\beta_j + \frac{1}{\nu}\beta_j^{{\bf T}}\beta_j)\right]\\ &\propto&\exp \left[-\frac{1}{2\sigma^2} (-2\beta_j^{{\bf T}}x_j^{{\bf T}}{\bf y}+ 2\beta_j^{{\bf T}}x_j^{{\bf T}} {\bf X}_{-j}{\boldsymbol \beta}_{-j}+\beta_j^{{\bf T}}x_j^{{\bf T}}x_j \beta_j + \frac{1}{\nu}\beta_j^{{\bf T}}\beta_j)\right]\\ &=&\exp \left[-\frac{1}{2\sigma^2} \left(\beta_j^{{\bf T}}(x_j^{{\bf T}}x_j +\frac{1}{\nu})\beta_j -2\beta_j^{{\bf T}}x_j^{{\bf T}}({\bf y}-{\bf X}_{-j}{\boldsymbol \beta}_{-j})\right)\right]\\ &=&\exp \left[-\frac{1}{2\sigma^2} (x_j^{{\bf T}}x_j +\frac{1}{\nu})\left(\beta_j^{{\bf T}}\beta_j -2\beta_j^{{\bf T}}(x_j^{{\bf T}}x_j +\frac{1}{\nu})^{-1}x_j^{{\bf T}}({\bf y}-{\bf X}_{-j}{\boldsymbol \beta}_{-j})\right)\right]\\ &=&\exp \left[-\frac{1}{2\sigma^2}(x_j^{{\bf T}}x_j +\frac{1}{\nu})(\beta_j-\tilde \beta_j)^2 \right] \end{eqnarray*}\] where \(\tilde \beta_j=(x_j^{{\bf T}}x_j +\frac{1}{\nu})^{-1}x_j^{{\bf T}}({\bf y}-{\bf X}_{-j}{\boldsymbol \beta}_{-j})\), \({\bf X}_{-j}\) is a submatrix of \({\bf X}\) without the \(j^{th}\) column, and \({\boldsymbol \beta}_{-j}\) is a subvector of \({\boldsymbol \beta}\) without the \(j^{th}\) element. Hence

\[ \beta_j|{\boldsymbol \beta}_{-j},\sigma^2,{\bf y}\sim \mathcal{N}\left((x_j^{{\bf T}}x_j +\frac{1}{\nu})^{-1} x_j^{{\bf T}}({\bf y}-{\bf X}_{-j}{\boldsymbol \beta}_{-j}),\sigma^2\left[x_j^{{\bf T}}x_j +\frac{1}{\nu}\right]^{-1}\right) \]

Setup

  • Consider

\[ f({\bf y}|{\boldsymbol \beta},\sigma^2) = (2\pi\sigma^2)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\sigma^2} \parallel{\bf y}-{\bf X}{\boldsymbol \beta}\parallel^2\right), \] \[ \pi({\boldsymbol \beta}|\sigma^2) = (2\pi\sigma^2\nu)^{-\frac{p}{2}}\exp\left(-\frac{1}{2\sigma^2\nu} \parallel{\boldsymbol \beta}\parallel^2\right). \]

Generate 30 samples from \[ y = 1 + 1\times x_1 + 2 \times x_2 + 0 \times x_3 + 0 \times x_4 + 10\times x_5 + \epsilon \] where \(\epsilon \sim \mathcal{N}(0,2),x_i \sim \mathcal{N}(0,1).\beta=(1,1,2,0,0,10)\)

  1. use frequentist way method to estimate \(\beta\)

  2. use Bayesian MC

  3. use Bayesian MCMC

Computation Algorithm

  • In frequentist way, that is OLS method, we know the estimate of \({\boldsymbol \beta}\) is \(\hat {\boldsymbol \beta}= ({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}{\bf y}\). The mean square of error is \(\parallel\hat{\boldsymbol \beta}-{\boldsymbol \beta}\parallel^2/6\).

  • In Bayesian MC way, we assume that

\(\sigma^2 \sim \mathcal{IG}(a_0, b_0)\)

\({\boldsymbol \beta}|\sigma^2 \sim \mathcal{N}(0, \sigma^2\nu I_p)\)

Then we have proved that

\(\sigma^2|{\bf y}\sim\mathcal{IG}(a^*, b^*)\), where \(a^* = \frac{n}{2}+ a_0,b^*=b_0+\frac{1}{2}{\bf y}^{{\bf T}} (I_n-{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}\)

\({\boldsymbol \beta}|{\bf y},\sigma^2 \sim\mathcal{N}(A^{-1}{\bf X}^{{\bf T}}{\bf y}, {\bf A}^{-1}\sigma^2)\), where \({\bf A}=({\bf X}^{{\bf T}}{\bf X}+ \frac{1}{\nu}I_p)\)

hence \({\boldsymbol \beta}^{(m)}\) are generaed as follows:

Finally, we estimate \(\hat{\boldsymbol \beta}\) by computing \(E({\boldsymbol \beta}|{\bf y})=\sum_{m=1}^{M} {\boldsymbol \beta}^{(m)} /M\). The mean square of error is \(\parallel\hat{\boldsymbol \beta}-{\boldsymbol \beta}\parallel^2/6\).

  • In Bayesian MCMC way, we have proved that

    1. \(\sigma^2|{\bf y},{\boldsymbol \beta}\sim\mathcal{IG}(a^*, b^*)\), where \(a^* = \frac{n+p}{2}+ a_0,b^*=b_0+\frac{1}{2}\parallel {\bf y}- {\bf X}{\boldsymbol \beta}\parallel^2+ \frac{1}{2\nu}\parallel {\boldsymbol \beta}\parallel^2\)
    1. \({\boldsymbol \beta}|{\bf y},\sigma^2 \sim\mathcal{N}({\bf A}^{-1}{\bf X}^{{\bf T}}{\bf y}, {\bf A}^{-1}\sigma^2)\),where \({\bf A}=({\bf X}^{{\bf T}}{\bf X}+ \frac{1}{\nu}I_p)\)
    1. \(\beta_j|{\boldsymbol \beta}_{-j},\sigma^2,{\bf y}\sim \mathcal{N}\left((x_j^{{\bf T}}x_j +\frac{1}{\nu})^{-1}x_j^{{\bf T}}({\bf y}-{\bf X}_{-j}{\boldsymbol \beta}_{-j}),\sigma^2\left[x_j^{{\bf T}}x_j +\frac{1}{\nu}\right]^{-1}\right)\)

Then, we use Gibbs sampling method to get the \({\boldsymbol \beta}^{(m)}=(\beta_0^{(m)},\beta_1^{(m)},\ldots,\beta_5^{(m)})\). Set the initial values \((\beta_0^{(0)},\beta_1^{(0)},\ldots,\beta_5^{(0)},\sigma^{2(0)})\).Then iterate the following steps for \(t=0,1,2,\ldots\) as follows:

    1. Generate \(\sigma^{2(t+1)}\) from \(\pi(\sigma^2|{\bf y},\beta_0^{(t)},\beta_1^{(t)},\ldots,\beta_5^{(t)})\).
    1. Generate \(\beta_0^{(t+1)}\) from \(\pi(\beta_0|{\bf y},{\boldsymbol \beta}_{-0}^{(t)},\sigma^{2(t+1)})\).
    1. update \({\boldsymbol \beta}^{(t)}\) with \({\boldsymbol \beta}^{(t)}=(\beta_0^{(t+1)},\beta_1^{(t)},\ldots,\beta_5^{(t)})\)
    1. Generate \(\beta_1^{(t+1)}\) from \(\pi(\beta_1|{\bf y},{\boldsymbol \beta}_{-1}^{(t)},\sigma^{2(t+1)})\).
    1. update \({\boldsymbol \beta}^{(t)}\) with \({\boldsymbol \beta}^{(t)}=(\beta_0^{(t+1)},\beta_1^{(t+1)},\beta_2^{(t)},\ldots,\beta_5^{(t)})\) and similarly, generate \(\beta_2^{(t+1)},\ldots,\beta_5^{(t+1)}\) and update \({\boldsymbol \beta}^{(t)}\) after generating \(\beta_i^{(t+1)}\) every time,

where \({\boldsymbol \beta}_{-i}^{(t)}\) is a subvector of \({\boldsymbol \beta}\) without \(i^{th}\) element.

We setup the total sample size is 5000, and burning period is 2000.

Finally, we estimate \(\hat{\boldsymbol \beta}\) by computing \(E({\boldsymbol \beta}|{\bf y})=\sum_{m=2001}^{M} {\boldsymbol \beta}^{(m)} /M\). The mean square of error is \(\parallel\hat{\boldsymbol \beta}-{\boldsymbol \beta}\parallel^2/6\).

Simulation Results

Bayesian V.S. Frequentist
\(\beta_0\) \(\beta_1\) \(\beta_2\) \(\beta_3\) \(\beta_4\) \(\beta_5\) MSE
Frequentist 1.194 0.991 2.080 -0.082 -0.051 9.608 0.035
Bayesian MC 1.195 0.972 2.084 -0.090 -0.068 9.570 0.041
Bayesian MCMC 1.187 0.986 2.071 -0.085 -0.058 9.581 0.038

Model Selection

Model Selection

Consider we have k candidate models, \(\mathcal{M}=\{M_1, M_2, \ldots, M_k\}\). A Bayesian way to select the best model is to find the model which maximizes the model posterior probability given data, that is

\[\hat M = \arg \max_{M\in\mathcal{M}}\pi(M|{\bf y})\] where \[\pi(M|{\bf y}) = \frac{f({\bf y}|M)\pi(M)}{\sum_{\tilde M\in\mathcal{M}}f({\bf y}|\tilde M)\pi(\tilde M)}\] in which, the denominator is free of \(M\). note that the prior \(\pi(M)=1/k\). Then we have \[\pi(M|{\bf y}) = \frac{f({\bf y}|M)}{\sum_{w\in\mathcal{M}}f({\bf y}|\tilde M)}\] This leads to \[\hat M = \arg\max_{M\in\mathcal{M}} f({\bf y}|M)\] where \(f({\bf y}|M) = \int f({\bf y}|{\boldsymbol \theta}_M,M)\pi({\boldsymbol \theta}_M)d{\boldsymbol \theta}_M\), where \({\boldsymbol \theta}_M\) is the parameter under model \(M\). We call \(f({\bf y}|M)\) as marginal likelihood under model \(M\),\(f({\bf y}|{\boldsymbol \theta}_M,M)\) likelihood under model \(M\) and \(\pi({\boldsymbol \theta}_M)\) prior under model M. Under \(M\) \[\begin{eqnarray*} f({\bf y}|M) &=& \int f({\bf y}|{\boldsymbol \theta},M)\pi({\boldsymbol \theta}|M)d{\boldsymbol \theta}\\ &=& \int\int f({\bf y}|{\boldsymbol \beta},\sigma^2,M)\pi({\boldsymbol \beta}|\sigma^2,M)\pi(\sigma^2|M)d{\boldsymbol \beta}d\sigma^2\\ &=& \int \int (2\pi)^{-\frac{n}{2}}(\sigma^2)^{-\frac{n}{2}}\exp\left[-\frac{1}{2\sigma^2}({\bf y}- {\bf X}{\boldsymbol \beta}_M)^{{\bf T}}({\bf y}-{\bf X}{\boldsymbol \beta}_M)\right]\times\\ &&\quad\quad (2\pi)^{-\frac{p_M}{2}}(\sigma^2\nu)^{-\frac{p_M}{2}}\exp(-\frac{1}{2\sigma^2\nu}{\boldsymbol \beta}_M^{{\bf T}}{\boldsymbol \beta}_M)\times\\ &&\quad\quad \frac{b_1^{a_1}}{\Gamma(a_1)}(\sigma^2)^{(-a_1-1)}\exp(-\frac{b_1}{\sigma^2})d{\boldsymbol \beta}_M d\sigma^2\\ &=& \frac{b_1^{a_1}}{\Gamma(a_1)\nu^{\frac{p_M}{2}}}(2\pi)^{(-\frac{n+p_M}{2})}\int \int \sigma^{2(\frac{-(n+p_M+2a_1+2)}{2})}\times\\ &&\exp\left[-\frac{1}{2\sigma^2} \left({\bf y}^T{\bf y}- 2{\boldsymbol \beta}_M^T{\bf X}^T{\bf y}+ {\boldsymbol \beta}_M^T{\bf X}^T{\bf X}{\boldsymbol \beta}_M + \frac{1}{\nu}{\boldsymbol \beta}_M^T{\boldsymbol \beta}_M +2b_1 \right)\right]d{\boldsymbol \beta}_M d\sigma^2\\ \end{eqnarray*}\] Note that \[\begin{eqnarray*} &&{\bf y}^T{\bf y}- 2{\boldsymbol \beta}_M^T{\bf X}^T{\bf y}+ {\boldsymbol \beta}_M^T{\bf X}^T{\bf X}{\boldsymbol \beta}_M + \frac{1}{\nu}{\boldsymbol \beta}_M^T{\boldsymbol \beta}_M +2b_1\\ &=&{\boldsymbol \beta}_M^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_{p_M}\right){\boldsymbol \beta}_M - 2 {\boldsymbol \beta}_M^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+ \frac{1}{\nu}{\bf I}_{p_M}\right)\left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_{p_M}\right)^{-1} {\bf X}^{{\bf T}} {\bf y}+ {\bf y}^T{\bf y}+2b_1\\ &=&{\boldsymbol \beta}_M^{{\bf T}}{\bf A}{\boldsymbol \beta}_M - 2{\boldsymbol \beta}_M^{{\bf T}}{\bf A}\tilde{{\boldsymbol \beta}_M} + \tilde{{\boldsymbol \beta}_M}^{{\bf T}} {\bf A}\tilde{{\boldsymbol \beta}_M} - \tilde{{\boldsymbol \beta}_M}^{{\bf T}}{\bf A}\tilde{{\boldsymbol \beta}_M}+ {\bf y}^{{\bf T}}{\bf y}+2b_1\\ &=& ({\boldsymbol \beta}_M- \tilde{{\boldsymbol \beta}_M})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}_M}) + {\bf y}^{{\bf T}}{\bf y}- {\bf y}^{{\bf T}}{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}{\bf y}+ 2b_1\\ &=& ({\boldsymbol \beta}_M- \tilde{{\boldsymbol \beta}_M})^T{\bf A}({\boldsymbol \beta}_M-\tilde{{\boldsymbol \beta}_M}) + {\bf y}^{{\bf T}}({\bf I}_n-{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}+ 2b_1 \end{eqnarray*}\] Therefore, \[\begin{eqnarray*} f({\bf y}|M) &=&\frac{b_1^{a_1}}{\Gamma(a_1)\nu^{\frac{p{_M}}{2}}}(2\pi)^{\left(-\frac{n+p{_M}}{2}\right)}\int \int \sigma^{2\left(\frac{-(n+p_M+2a_1+2)}{2}\right)}\times \\ &&\exp\left[-\frac{1}{2\sigma^2}\left(({\boldsymbol \beta}_M- \tilde{{\boldsymbol \beta}_M})^T{\bf A}({\boldsymbol \beta}_M-\tilde{{\boldsymbol \beta}_M}) + {\bf y}^{{\bf T}}({\bf I}_n-{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}+ 2b_1\right)\right]d{\boldsymbol \beta}_M d\sigma^2\\ &=& \frac{b_1^{a_1}}{\Gamma(a_1)\nu^{\frac{p_M}{2}}}(2\pi)^{\left(-\frac{n+p_M}{2}\right)}\int\sigma^{2\left(\frac{-(n+p_M+2a_1+2)}{2}\right)}\times\\ &&(2\pi)^{\frac{p_M}{2}}\lvert{\bf A}\rvert^{-\frac{1}{2}}\exp\left[-\frac{1}{2\sigma^2}\left( {\bf y}^{{\bf T}}({\bf I}_n-{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}+ 2b_1\right)\right]d\sigma^2\\ &=& \frac{b_1^{a_1}}{\Gamma(a_1)\nu^{\frac{p_M}{2}}}(2\pi)^{(-\frac{n}{2})}\lvert{\bf A}\rvert^{-\frac{1}{2}}\int \sigma^{2(-a_M^*-1)}\exp(-\frac{b_M^*}{\sigma^2})d\sigma^2\\ &=& \frac{b_1^{a_1}}{\Gamma(a_1)\nu^{\frac{p_M}{2}}}(2\pi)^{(-\frac{n}{2})}\lvert{\bf A}\rvert^{-\frac{1}{2}} \Gamma(a_M^*)b_M^{*(-a_M^*)}\\ &=&(2\pi)^{(-\frac{n}{2})}\frac{\Gamma(a_M^*)b_M^{*(-a_M^*)}b_1^{a_1}}{\Gamma(a_1)\nu^{\frac{p_M}{2}}}\lvert{\bf A}\rvert^{-\frac{1}{2}} \end{eqnarray*}\] where \[\begin{eqnarray*} {\bf A}&=&{\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_{p_M}, \quad\tilde{{\boldsymbol \beta}_M} = {\bf A}^{-1} {\bf X}^{{\bf T}} {\bf y}\\ a_M^* &=& \frac{n+p_M}{2}+a_1,\quad b_M^*=\frac{1}{2}{\bf y}^{{\bf T}}({\bf I}_n-{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}+ b_1 \end{eqnarray*}\] After Laplace approximation, \(f({\bf y}|M)\) is equivalent to BIC for large n.

MCMC

Minghui Chen (IWDE)

Goal is to compute \[ m({\bf y}) = \int f({\bf y}|{\boldsymbol \theta})\pi({\boldsymbol \theta})d{\boldsymbol \theta} \]

\[\begin{eqnarray} \pi({\boldsymbol \theta}|{\bf y})=\frac{f({\bf y}|{\boldsymbol \theta})\pi({\boldsymbol \theta})}{\int f({\bf y}|{\boldsymbol \theta})\pi({\boldsymbol \theta})d{\boldsymbol \theta}} \end{eqnarray}\] Let \(g({\boldsymbol \theta})\) be a density function of \({\boldsymbol \theta}\)

Then,

\[\begin{eqnarray*} &&\int g({\boldsymbol \theta})d{\boldsymbol \theta}= 1 \\ \Leftrightarrow && \int g({\boldsymbol \theta})\frac{\pi({\boldsymbol \theta}|{\bf y})}{\pi({\boldsymbol \theta}|{\bf y})}d{\boldsymbol \theta}=1\\ \Leftrightarrow && \int g({\boldsymbol \theta})\frac{\pi({\boldsymbol \theta}|{\bf y})}{\frac{f({\bf y}|{\boldsymbol \theta}) \pi({\boldsymbol \theta})}{m({\bf y})}}d{\boldsymbol \theta}=1\\ \Leftrightarrow && m({\bf y})\int \frac{g({\boldsymbol \theta})}{f({\bf y}|{\boldsymbol \theta})\pi({\boldsymbol \theta})}\pi({\boldsymbol \theta}|{\bf y}) d{\boldsymbol \theta}=1\\ \Leftrightarrow && m({\bf y}) = \left[\int \frac{g({\boldsymbol \theta})}{f({\bf y}|{\boldsymbol \theta})\pi({\boldsymbol \theta})}\pi({\boldsymbol \theta}|{\bf y}) d{\boldsymbol \theta}\right]^{-1}\\ &&= E\left[\frac{g({\boldsymbol \theta})}{f({\bf y}|{\boldsymbol \theta})\pi({\boldsymbol \theta})}|{\bf y}\right]^{-1}\\ &&\approx \left[\sum_{m=1}^{M}\frac{g({\boldsymbol \theta}^{(m)})}{f({\bf y}|{\boldsymbol \theta}^{(m)})\pi({\boldsymbol \theta}^{(m)})}/M\right]^{-1} \end{eqnarray*}\] where \({\boldsymbol \theta}^{(m)}\) is from \(\pi({\boldsymbol \theta}|{\bf y})\).

There maxmimum of \(m({\bf y})\) is equivalent to maximize \[ E\left[\frac{g({\boldsymbol \theta})}{f({\bf y}|{\boldsymbol \theta})\pi({\boldsymbol \theta})}|{\bf y}\right]^{-1} \]

Now let’s use Minghui Chen’s method to maximize the \(f({\bf y}|M_i)\).

\(f({\bf y}|M_i)=\int f({\bf y}|{\boldsymbol \theta},M_i)\pi({\boldsymbol \theta}|M_i)d{\boldsymbol \theta}\), where \({\boldsymbol \theta}=(\beta_0,\beta_1,\sigma^2)\) \[\begin{eqnarray} \pi({\boldsymbol \theta}|{\bf y},M_i)&=&\frac{f({\bf y}|{\boldsymbol \theta},M_i)\pi({\boldsymbol \theta}|M_i)}{\int f({\bf y}|{\boldsymbol \theta},M_i)\pi({\boldsymbol \theta}|M_i)d{\boldsymbol \theta}}\\ &=&\frac{f({\bf y}|{\boldsymbol \theta},M_i)\pi({\boldsymbol \theta}|M_i)}{f({\bf y}|M_i)} \end{eqnarray}\] Let \(g({\boldsymbol \theta})\) be a density function of \({\boldsymbol \theta}\)

Then

\[\begin{eqnarray*} &&\int g({\boldsymbol \theta})d{\boldsymbol \theta}= 1 \\ \Leftrightarrow && \int g({\boldsymbol \theta})\frac{\pi({\boldsymbol \theta}|{\bf y},M_i)}{\pi({\boldsymbol \theta}|{\bf y},M_i)}d{\boldsymbol \theta}=1\\ \Leftrightarrow && \int g({\boldsymbol \theta})\frac{\pi({\boldsymbol \theta}|{\bf y},M_i)}{\frac{f({\bf y}|{\boldsymbol \theta},M_i) \pi({\boldsymbol \theta}|M_i)}{f({\bf y}|M_i)}}d{\boldsymbol \theta}=1\\ \Leftrightarrow && f({\bf y}|M_i)\int \frac{g({\boldsymbol \theta})}{f({\bf y}|{\boldsymbol \theta},M_i)\pi({\boldsymbol \theta}|M_i)}\pi({\boldsymbol \theta}|{\bf y},M_i) d{\boldsymbol \theta}=1\\ \Leftrightarrow && f({\bf y}|M_i) = \left[\int \frac{g({\boldsymbol \theta})}{f({\bf y}|{\boldsymbol \theta},M_i)\pi({\boldsymbol \theta}|M_i)}\pi({\boldsymbol \theta}|{\bf y},M_i) d{\boldsymbol \theta}\right]^{-1}\\ &&= E\left[\frac{g({\boldsymbol \theta})}{f({\bf y}|{\boldsymbol \theta},M_i)\pi({\boldsymbol \theta}|M_i)}|{\bf y},M_i\right]^{-1}\\ &&\approx \left[\sum_{n=1}^{N}\frac{g({\boldsymbol \theta}^{(n)})}{f({\bf y}|{\boldsymbol \theta}^{(n)},M_i)\pi({\boldsymbol \theta}^{(n)}|M_i)}/N\right]^{-1} \end{eqnarray*}\] where \({\boldsymbol \theta}^{(n)}\) is from \(\pi({\boldsymbol \theta}|{\bf y},M_i), i=1,2,\ldots, k\).

The issue here is a choice of \(g({\boldsymbol \theta})\) to minimize MC error.

Recall \[ \pi({\boldsymbol \theta}|{\bf y}) \propto f({\bf y}|{\boldsymbol \theta})\pi({\boldsymbol \theta})\approx g({\boldsymbol \theta}) \] Let \(g({\boldsymbol \theta})\) be a normal density such that \(E_g({\boldsymbol \theta})=\) posterior mean \(\approx\sum_{m=1}^{M}{\boldsymbol \theta}^{(m)}/M = \bar {\boldsymbol \theta}\), \(V_g({\boldsymbol \theta})=\) posterior variance matrix \(\approx \sum_{m=1}^{M}({\boldsymbol \theta}^{(m)}-\bar {\boldsymbol \theta})({\boldsymbol \theta}^{(m)}-\bar {\boldsymbol \theta})^{{\bf T}}/M\), where \({\boldsymbol \theta}^{(m)}\) is a MCMC sample from \(\pi({\boldsymbol \theta}|{\bf y})\).

For given model M, using MCMC, generate \({\boldsymbol \theta}_M^{(k)}\) from \(\pi({\boldsymbol \theta}_M|{\bf y}, M)\), then compute \(\bar {\boldsymbol \theta}_M\) and \(V_M\).

Define \(g({\boldsymbol \theta}|M) = \mathcal{N({\boldsymbol \theta};\bar {\boldsymbol \theta}_M, V_M)}\).

Compute

\[\begin{eqnarray} \tag{4}f({\bf y}|M)\approx \left[\sum_{k=1}^{K}\frac{g({\boldsymbol \theta}^{(k)}|M)}{f({\bf y}|{\boldsymbol \theta}^{(k)},M)\pi({\boldsymbol \theta}^{(k)}|M)}/K\right]^{-1} \end{eqnarray}\]

Bayesian Simulation Case Study

Setup

  • Consider

\[ f({\bf y}|{\boldsymbol \beta},\sigma^2) = (2\pi\sigma^2)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\sigma^2} \parallel{\bf y}-{\bf X}{\boldsymbol \beta}\parallel^2\right), \] \[ \pi({\boldsymbol \beta}|\sigma^2) = (2\pi\sigma^2\nu)^{-\frac{p}{2}}\exp\left(-\frac{1}{2\sigma^2\nu} \parallel{\boldsymbol \beta}\parallel^2\right). \]

Generate 30 samples from \[ y = 1 + 1\times x_1 + 2 \times x_2 + 0 \times x_3 + 0 \times x_4 + 10\times x_5 + \epsilon \] where \(\epsilon \sim \mathcal{N}(0,2),x_i \sim \mathcal{N}(0,1).\beta=(1,1,2,0,0,10)\)

Therefore, the true model, say \(M_{ture}\), is \[ y = 1 + 1\times x_1 + 2 \times x_2 + 10\times x_5 + \epsilon \]

    1. use Bayesian MC method to find true model
    1. use frequentist way method to find true model

Our candidate models are as follows: \[ M_1: y = \beta_0 + \epsilon; \] \[ M_2: y = \beta_0 + \beta_1X_1+ \epsilon; \] \[ \ldots \] \[ M_{32}: y = 1 + 1\times x_1 + 2 \times x_2 + 0 \times x_3 + 0 \times x_4 + 10\times x_5 + \epsilon \]

  • For bayesian way, compute

\(f({\bf y}|M_k)\) for \(k = 1,2,\ldots, 32\) and select the best model \(\hat M\), such that \[ f({\bf y}|\hat M) \geq f({\bf y}| M_k) \] If \(\hat M = M_{true}\), then assign \(Z_i = 1\), otherwise \(Z_i = 0\).

Repeat the above procedure for \(R=500\) times, and compute \(\sum_{i = 1}^{R} \frac{Z_i}{R} \simeq P(\) selecting true model with Bayes method \()\).

  • For Frequentist way, similarly, build linear model for 32 candidate models, compute AIC and BIC. If the smallest AIC or BIC is ture model, then \(W_i = 1\), otherwise \(W_i = 0\). Repeat it 500 times and compute \(\sum_{i = 1}^{R} \frac{W_i}{R}\).

Consider the sample sizes are 20, 50, 100 and 300 so that we can see how sample size impacts the probability of selecting the true model.

  • For MCMC, sampling with MCMC method, then compute Eq. (4) for 32 candidates models, select the best model \(\hat M\), such that \[ f({\bf y}|\hat M) \geq f({\bf y}| M_k) \] If \(\hat M = M_{true}\), then assign \(Z_i = 1\), otherwise \(Z_i = 0\)

Repeat the above procedure for \(R=500\) times, and compute \(\sum_{i = 1}^{R} \frac{Z_i}{R} \simeq P(\) selecting true model with MCMC method \()\)

Result

MC AIC BIC MCMC Error
20 0.502 0.560 0.656 0.508 59.401
50 0.894 0.668 0.882 0.880 78.224
100 0.998 0.676 0.916 0.926 93.176
300 0.998 0.712 0.978 NA NA