Caleb

Stats Ph.D.

Bayesian Linear Model

Caleb Jin / 2019-04-18


Linear model

Consider a multiple linear regression model as follows: y=Xβ+ϵ,y=Xβ+ϵ, where y=(y1,y2,,yn)Ty=(y1,y2,,yn)T is the nn-dimensional response vector, X=(x1,x2,,xn)TX=(x1,x2,,xn)T is the n×pn×p design matrix, and ϵNn(0,σ2In)ϵNn(0,σ2In). We assume that p<np<n and XX is full rank.

Maximum likelihood estimation (MLE) approach

Since yNn(Xβ,σ2In)yNn(Xβ,σ2In), the likelihood function is given as L(β,σ2)=f(y|β,σ2)=(2π)n2|Σ|12exp{12(yXβ)TΣ1(yXβ)},L(β,σ2)=f(y|β,σ2)=(2π)n2|Σ|12exp{12(yXβ)TΣ1(yXβ)},

where Σ=σ2InΣ=σ2In. Then the log likelihood can be written as l(β,σ2)=logL(β,σ2)=n2log(2π)n2log(σ2)12σ2(yXβ)T(yXβ).l(β,σ2)=logL(β,σ2)=n2log(2π)n2log(σ2)12σ2(yXβ)T(yXβ). Note that l(β,σ2)l(β,σ2) is a concave function in (β,σ2)(β,σ2). Hence, the maximum likelihood estimator (MLE) can be obtained by solving the following equations: l(β,σ2)β=12σ2(2XTy+2XTXβ)=0;l(β,σ2)σ2=n21σ2+121(σ2)2(yXβ)T(yXβ)=0.l(β,σ2)β=12σ2(2XTy+2XTXβ)=0;l(β,σ2)σ2=n21σ2+121(σ2)2(yXβ)T(yXβ)=0. Therefore, the MLEs of ββ and σ2σ2 are given as ˆβ=(XTX)1XTy;ˆσ2=(yXˆβ)T(yXˆβ)n=yˆy2n,^β=(XTX)1XTy;^σ2=(yX^β)T(yX^β)n=y^y2n, where ˆy=Xˆβ^y=X^β.

Distribution of ˆβ^β and ˆσ2^σ2

Note that if zNk(μ,Σ)zNk(μ,Σ), then AzNm(Aμ,AΣAT)AzNm(Aμ,AΣAT), where AA is an m×km×k matrix. Since yNn(Xβ,σ2In)yNn(Xβ,σ2In) and ˆβ=(XTX)1XTy^β=(XTX)1XTy, we have ˆβNp((XTX)1XTXβ,σ2(XTX)1XTX(XTX)1)=Np(β,σ2(XTX)1).^βNp((XTX)1XTXβ,σ2(XTX)1XTX(XTX)1)=Np(β,σ2(XTX)1).

Note that yˆy=(InX(XTX)1XT)yy^y=(InX(XTX)1XT)y, where InX(XTX)1XTInX(XTX)1XT is an idempotent matrix with rank (np)(np).

Can you prove that InX(XTX)1XT s an idempotent matrix of rank (np) ?Can you prove that InX(XTX)1XT s an idempotent matrix of rank (np) ?

Proof. Let H=X(XTX)1XTH=X(XTX)1XT.

HH=X(XTX)1XTX(XTX)1XT=X(XTX)1XT=HHH=X(XTX)1XTX(XTX)1XT=X(XTX)1XT=H thus HH is idempotent matrix.

Similarly, as (InH)(InH)=InH(InH)(InH)=InH, (InH)(InH) is also idempotent.

Hence we have r(InH)=tr(InH)=ntr(H)=ntr((XTX)1XTX)=npr(InH)=tr(InH)=ntr(H)=ntr((XTX)1XTX)=np

How to prove r(InH)=tr(InH)How to prove r(InH)=tr(InH)

The eigenvalue of idempotent matrix is either 1 or 0, hence the rank of it is the sum of eigenvalues, which equals the trace of matrix;

How to prove trace of matrix is the sum of eigenvalues?How to prove trace of matrix is the sum of eigenvalues?

It requires characteristic polynomial.

Another way to prove this is in this link

From Lemma 1, we have that nˆσ2σ2χ2(np),n^σ2σ2χ2(np),(1) where χ2(np)χ2(np) denotes the chi-squared distribution with degrees of freedom npnp.

Can you prove Eq.Can you prove Eq.(1)?

Proof. By Lemma 1, we have
nˆσ2σ2=ˆeTˆeσ2=yT(InHσ2)y=yTAyχ2(tr(AΣ),μTAμ/2)n^σ2σ2=^eT^eσ2=yT(InHσ2)y=yTAyχ2(tr(AΣ),μTAμ/2)

where A=InHσ2A=InHσ2, μTAμ/2=(Xβ)T(InHσ2)(Xβ)/2=0μTAμ/2=(Xβ)T(InHσ2)(Xβ)/2=0. AΣ=(InHσ2)σ2In=InH,AΣ=(InHσ2)σ2In=InH, hence r(AΣ)=npr(AΣ)=np. Therefore nˆσ2σ2χ2(np)n^σ2σ2χ2(np).
Lemma 1 Let zNk(μ,Σ)zNk(μ,Σ) with r(Σ)=kr(Σ)=k, where r(Σ)r(Σ) denotes the rank of ΣΣ. If AΣAΣ (or ΣAΣA) is an idempotent matrix of rank mm, then zTAzzTAz follows the non-central chi-squared distribution with degrees of freedom mm and non-central parameter λ=μTAμ/2λ=μTAμ/2.

Bayesian approach

σ2σ2 is known

Suppose σ2σ2 is known. We define the prior distribution of ββ by βNp(0,σ2νIp)βNp(0,σ2νIp). Then the posterior density of ββ can be obtained by π(β|y)f(y|β)π(β)exp(12σ2yXβ2)×exp(12σ2νβ|2)=exp[12σ2{(yXβ)T(yXβ)+1νβTβ}]exp{12σ2(2βTXTy+βTXTXβ+1νβTβ)}=exp{12σ2(2βT(XTX+1νIp)(XTX+1νIp)1XTy+βT(XTX+1νIp)β)}exp{12σ2(β˜β)T(XTX+1νIp)(β˜β)}π(β|y)f(y|β)π(β)exp(12σ2yXβ2)×exp(12σ2νβ|2)=exp[12σ2{(yXβ)T(yXβ)+1νβTβ}]exp{12σ2(2βTXTy+βTXTXβ+1νβTβ)}=exp{12σ2(2βT(XTX+1νIp)(XTX+1νIp)1XTy+βT(XTX+1νIp)β)}exp{12σ2(β~β)T(XTX+1νIp)(β~β)} where ˜β=(XTX+1ν)1XTy~β=(XTX+1ν)1XTy.

This implies that β|yNp((XTX+1νIp)1XTy,σ2(XTX+1νIp)1).β|yNp((XTX+1νIp)1XTy,σ2(XTX+1νIp)1).(2) The Bayesian estimate ˆβBayes=(XTX+1νIp)1XTy^βBayes=(XTX+1νIp)1XTy. It is worth noting that if νν, the posterior distribution converges to the distribution of ˆβMLENp(β,σ2(XTX)1)^βMLENp(β,σ2(XTX)1).

σ2σ2 is unknown

In general, σ2σ2 is unknown. Then, we need to assign a reasonable prior distribution for σ2σ2. We consider the inverse gamma distribution, which is called a , as follows: σ2IG(a0,b0)σ2IG(a0,b0) with the density function π(σ2)=ba00Γ(a0)(σ2)a01exp(b0σ2),π(σ2)=ba00Γ(a0)(σ2)a01exp(b0σ2), where a0>0a0>0 and b0>0b0>0. In addition, we need to introduce prior for β|σ2β|σ2:

β|σ2Np(0,σ2νIp).β|σ2Np(0,σ2νIp).

Today, we derive the joint posterior distributionπ(β,σ2|y)f(y|β,σ)π(β|σ2)π(σ2).Today, we derive the joint posterior distributionπ(β,σ2|y)f(y|β,σ)π(β|σ2)π(σ2).

Show

    1. π(σ2|β,y)=IG(a,b)π(σ2|β,y)=IG(a,b)
    1. π(β|y)π(β|y) t-distribution with tt
Then the posterior density function of σ2σ2 given β,yβ,y can be obtained by

Proof (1). π(σ2|β,y)=f(y|β,σ2)π(β,σ2)f(y|β,σ2)π(β,σ2)dσ2=f(y|β,σ2)π(β|σ2)π(σ2)f(y,β,σ2)dσ2f(y|β,σ2)π(β|σ2)π(σ2)(σ2)n2exp(12σ2yXβ2)×(σ2)p2exp(12σ2νβ2)×(σ2)a01exp(b0σ2)=(σ2)(n+p2+a0)1exp[1σ2{12yXβ2+12νβ2+b0}]=(σ2)a1exp(bσ2)π(σ2|β,y)=f(y|β,σ2)π(β,σ2)f(y|β,σ2)π(β,σ2)dσ2=f(y|β,σ2)π(β|σ2)π(σ2)f(y,β,σ2)dσ2f(y|β,σ2)π(β|σ2)π(σ2)(σ2)n2exp(12σ2yXβ2)×(σ2)p2exp(12σ2νβ2)×(σ2)a01exp(b0σ2)=(σ2)(n+p2+a0)1exp[1σ2{12yXβ2+12νβ2+b0}]=(σ2)a1exp(bσ2) where a=n+p2+a0a=n+p2+a0, b=12yXβ2+12νβ2+b0b=12yXβ2+12νβ2+b0.

This implies that σ2|β,yIG(n+p2+a0,12yXβ2+12νβ2+b0).σ2|β,yIG(n+p2+a0,12yXβ2+12νβ2+b0).


Proof (2). π(β|y)=π(β,σ2|y)dσ2=f(y|β,σ2)π(β,σ2)f(y|β,σ2)π(β,σ2)dβdσ2dσ2f(y|β,σ2)π(β|σ2)π(σ2)dσ2Γ(a)baba=[12yXβ2+12νβ2+b0]a[(yXβ)T(yXβ)+1νβTβ2b0]a=[yTy2βTXTy+βTXTXβ+1νβTβ+2b0]a[βT(XTX+1νIp)β2βT(XTX+1νIp)(XTX+1νIp)1XTy+yTy+2b0]a=[βTAβ2βTA˜β+˜βTA˜β˜βTA˜β+yTy+2b0]a=[(β˜β)TA(β˜β)+yTyyTXA1XTy+2b0]a=[(β˜β)TA(β˜β)+yT(InXA1XT)y+2b0]a[(β˜β)TA(β˜β)+c]a[1+1c(β˜β)TA(β˜β)]n+p+2a02=[1+ννc(β˜β)TA(β˜β)]p+ν2=[1+1ν(β˜β)T(cνA1)1(β˜β)]p+ν2π(β|y)=π(β,σ2|y)dσ2=f(y|β,σ2)π(β,σ2)f(y|β,σ2)π(β,σ2)dβdσ2dσ2f(y|β,σ2)π(β|σ2)π(σ2)dσ2Γ(a)baba=[12yXβ2+12νβ2+b0]a[(yXβ)T(yXβ)+1νβTβ2b0]a=[yTy2βTXTy+βTXTXβ+1νβTβ+2b0]a[βT(XTX+1νIp)β2βT(XTX+1νIp)(XTX+1νIp)1XTy+yTy+2b0]a=[βTAβ2βTA~β+~βTA~β~βTA~β+yTy+2b0]a=[(β~β)TA(β~β)+yTyyTXA1XTy+2b0]a=[(β~β)TA(β~β)+yT(InXA1XT)y+2b0]a[(β~β)TA(β~β)+c]a[1+1c(β~β)TA(β~β)]n+p+2a02=[1+ννc(β~β)TA(β~β)]p+ν2=[1+1ν(β~β)T(cνA1)1(β~β)]p+ν2 This implies that β|yMT(˜β,cνA1,ν),β|yMT(~β,cνA1,ν),(3)

where

A=XTX+1νIp,˜β=A1XTy,c=yT(InXA1XT)y+2b0,ν=n+2a0.A=XTX+1νIp,~β=A1XTy,c=yT(InXA1XT)y+2b0,ν=n+2a0.


Note: the density of a multiple t-distribution with Σ,μΣ,μ and νν is (3) Γ[(ν+p)/2]Γ(ν/2)νp/2πp/2|Σ|1/2[1+1ν(Xμ)TΣ1(Xμ)](ν+p)/2.Γ[(ν+p)/2]Γ(ν/2)νp/2πp/2|Σ|1/2[1+1ν(Xμ)TΣ1(Xμ)](ν+p)/2.

Monte Carlo Simulation

Suppose σ2σ2 is unknown. If βNp(0,σ2νIp)βNp(0,σ2νIp), we know the distribution of β|yβ|y is Eq. (3). According to this known distribution, we can easily compute the E(β|y)E(β|y) and V(β|y)V(β|y). But in practice, the form of density function π(β|y)π(β|y) is often an unknown and very complicated distribution, leading to be impossible to solve its integration directly. hence we turn to use numerical methods to solve this problem, such as Monte Carlo methods.

For example, a form density function π(θ|y)π(θ|y) is an unknown distribution. E(θ|y)=θπ(θ|y)dθE(θ|y)=θπ(θ|y)dθ. According to lemma , ˉXnμ=E(X)¯Xnμ=E(X) as nn. Therefore, if we indepedently generate θ(1),θ(2),,θ(M)θ(1),θ(2),,θ(M) from π(θ|y)π(θ|y), M1Mk=1θ(k)E(θ|y)M1Mk=1θ(k)E(θ|y) as M . However, there are two problems.

  1. What if we cannot generate sample from π(θ|y)π(θ|y)

  2. What if they are not identically and independently distributed.

The solutions to these two question are Monte Carlo (MC) method and Markov Chain Monte Carlo (MCMC).

For the first question, we can use importance sampling method.

Lemma 2 (Strong Law of Large Number)

Let X1,X2,...X1,X2,... be a sequence of i.i.d random variables, each having finite mean m. Then Pr(limn1n(X1+X2+...+Xn=m))=1.Pr(limn1n(X1+X2+...+Xn=m))=1.

σ2σ2 is known

Importance Sampling

To estimate mean value of parameters, usually we randomly sample from π(θ|y)π(θ|y) and compute their mean value to estimate parameters. However, in practice, it is hard to generate sample directly from π(θ|y)π(θ|y), because we don’t know what speccific distribution it belongs to. hence we return to importance sampling method.

Note that E(θ|y)=θπ(θ|y)dθ=θπ(θ|y)g(θ)g(θ)dθ=h(θ)g(θ)dθ=E{h(θ)}E(θ|y)=θπ(θ|y)dθ=θπ(θ|y)g(θ)g(θ)dθ=h(θ)g(θ)dθ=E{h(θ)} where h(θ)=θπ(θ|y)g(θ),g(θ)h(θ)=θπ(θ|y)g(θ),g(θ) is a known pdf.

To estimate the variance of parameters, similarly, Var(θ|y)=E[θE(θ|y)]2=(θE(θ|y))2π(θ|y)dθ=(θE(θ|y))2π(θ|y)g(θ)g(θ)dθ=h(θ)g(θ)dθ=E{h(θ)}Var(θ|y)=E[θE(θ|y)]2=(θE(θ|y))2π(θ|y)dθ=(θE(θ|y))2π(θ|y)g(θ)g(θ)dθ=h(θ)g(θ)dθ=E{h(θ)} where h(θ)=(θE(θ|y))2π(θ|y)g(θ),h(θ)=(θE(θ|y))2π(θ|y)g(θ), and g(θ)g(θ) is a known pdf.

The importance sampling method can be implemented as follows:

Note that Mi=1h(θi)ME(h(θ))=E(θ|y).asMMi=1h(θi)ME(h(θ))=E(θ|y).asM Estimating variance by importance sampling method is similarly.

Importance Sampling Simulation

Setup

Consider a single linear regression model as follows:

yi=β0+β1xi+εiyi=β0+β1xi+εi where εiN(0,1),xiN(0,1),β0=1,β1=2,εiN(0,1),xiN(0,1),β0=1,β1=2, for i=1,,100.i=1,,100.

We already know that if βNp(0,σ2νIp)βNp(0,σ2νIp), then the distribution of β|yβ|y is Eq.(2). Assume σ=1σ=1 is known and consider a known pdf π(β)=ϕ(β;(0,0),10I2)π(β)=ϕ(β;(0,0),10I2), hence in this case, ν=10ν=10. Our simulation can be broken down into 3 steps in following:

  1. Find exact value of E(β0|y),E(β1|y),V(β0|y)E(β0|y),E(β1|y),V(β0|y) and V(β1|y)V(β1|y)

  2. Use the MC method to simulate the results above with size of 100, 1000 and 5000.

  3. Use Importance Sampling Method to simulate relevant results with the same size in (2). Consider the known pdf of parameters are ϕ(β0|1,0.5)ϕ(β0|1,0.5) and ϕ(β1|2,0.5).ϕ(β1|2,0.5).

Simulation Results

We use R softeware to do this simulation.

  1. According to Eq. (2), compute E(β|y)=(XTX+1νIp)1XTyE(β|y)=(XTX+1νIp)1XTy and Var(β|y)=σ2(XTX+1νIp)1Var(β|y)=σ2(XTX+1νIp)1, where σ=1σ=1 and ν=10ν=10, then we get the following results:
    E(β0)E(β0) E(β1)E(β1) Var(β0)Var(β0) Var(β1)Var(β1)
    1.108 1.997 0.01 0.011
  2. Sample directly from normal distribution of Eq. (2) with sample size of 100, 1000 and 5000, then compute the mean and variance values of samples. hence we get the following results:
E(β0)E(β0) E(β1)E(β1) Var(β0)Var(β0) Var(β1)Var(β1)
100 1.104 2.002 0.009 0.012
1000 1.107 1.993 0.011 0.012
5000 1.108 1.996 0.010 0.011
  1. Sample β0β0 and β1β1 directly from ϕ(β0|1,0.5)ϕ(β0|1,0.5) and ϕ(β1|2,0.5)ϕ(β1|2,0.5), respectively, with sample size of 100, 1000 and 5000. And then compute h(θi)h(θi) and h(θi)h(θi). Compute mean values of them to get estimates of expectation and variance. The final results are following:
    E(β0)E(β0) E(β1)E(β1) Var(β0)Var(β0) Var(β1)Var(β1)
    100 1.0500 2.0623 0.0175 0.0127
    1000 1.1323 1.9521 0.0106 0.0127
    5000 1.1269 2.0478 0.0107 0.0137

σ2σ2 is unknown

Suppose σ2σ2 is unknown, we cannot use π(β|y,σ2)π(β|y,σ2) directly.

But we know π(β,σ2|y)=π(β|y,σ2)π(σ2|y)π(β,σ2|y)=π(β|y,σ2)π(σ2|y) and we already know

β|y,σ2Np((XTX+1νIp)1XTy,(XTX+1νIp)1σ2)β|y,σ2Np((XTX+1νIp)1XTy,(XTX+1νIp)1σ2)

So π(σ2|y)=π(σ2,β|y)dβf(y|σ2,β)π(β|σ2)π(σ2)dβ=f(y|σ2,β)π(β|σ2)dβ×π(σ2)(σ2)n2exp[12σ2(yXβ)T(yXβ)](σ2)p2exp(12σ2νβTβ)dβ×π(σ2)exp[12σ2(yTy2βTXTy+βTXTXβ+1νβTβ)]dβ((σ2)12(n+p)π(σ2))=exp[12σ2(βT(XTX+1νIp)β2βT(XTX+1νIp)(XTX+1νIp)1XTy+yTy)]dβ((σ2)12(n+p)π(σ2))=exp[12σ2(β˜β)TA(β˜β)+yTy˜βTA˜β]dβ((σ2)12(n+p)π(σ2))=exp[12σ2(β˜β)TA(β˜β)]dβ×exp[12σ2(yTyyTXA1XTy)]((σ2)12(n+p)π(σ2))=(2π)p2|σ2A1|12exp[12σ2(yT(InXA1XT)y)]((σ2)12(n+p)π(σ2))(σ2)p212(n+p)(σ2)a01exp[1σ2(b0+12yT(InXA1XT)y)]=(σ2)(12n+a0)1exp[1σ2(b0+12yT(InXA1XT)y)]=(σ2)a1exp(bσ2)π(σ2|y)=π(σ2,β|y)dβf(y|σ2,β)π(β|σ2)π(σ2)dβ=f(y|σ2,β)π(β|σ2)dβ×π(σ2)(σ2)n2exp[12σ2(yXβ)T(yXβ)](σ2)p2exp(12σ2νβTβ)dβ×π(σ2)exp[12σ2(yTy2βTXTy+βTXTXβ+1νβTβ)]dβ((σ2)12(n+p)π(σ2))=exp[12σ2(βT(XTX+1νIp)β2βT(XTX+1νIp)(XTX+1νIp)1XTy+yTy)]dβ((σ2)12(n+p)π(σ2))=exp[12σ2(β~β)TA(β~β)+yTy~βTA~β]dβ((σ2)12(n+p)π(σ2))=exp[12σ2(β~β)TA(β~β)]dβ×exp[12σ2(yTyyTXA1XTy)]((σ2)12(n+p)π(σ2))=(2π)p2|σ2A1|12exp[12σ2(yT(InXA1XT)y)]((σ2)12(n+p)π(σ2))(σ2)p212(n+p)(σ2)a01exp[1σ2(b0+12yT(InXA1XT)y)]=(σ2)(12n+a0)1exp[1σ2(b0+12yT(InXA1XT)y)]=(σ2)a1exp(bσ2) where A=(XTX+1νIp),˜β=A1xTy,a=12n+a0,b=b0+12yT(InXA1XT)yA=(XTX+1νIp),~β=A1xTy,a=12n+a0,b=b0+12yT(InXA1XT)y.

This is proportional to the pdf of IG(a,b)IG(a,b). Hence we have E(β|y)=βπ(β|y)dβ=β[π(β,σ2|y)dσ2]dβ=β[π(β|y,σ2)π(σ2|y)dσ2]dβ,E(β|y)=βπ(β|y)dβ=β[π(β,σ2|y)dσ2]dβ=β[π(β|y,σ2)π(σ2|y)dσ2]dβ, Similarly, Var(β|y)=(βE(β|y))π(β|y)dβ=(βE(β|y))[π(β,σ2|y)dσ2]dβ=(βE(β|y))[π(β|y,σ2)π(σ2|y)dσ2]dβ,Var(β|y)=(βE(β|y))π(β|y)dβ=(βE(β|y))[π(β,σ2|y)dσ2]dβ=(βE(β|y))[π(β|y,σ2)π(σ2|y)dσ2]dβ,

Plug-in Sampling Method

We can estimate mean of ββ by computing E(β|y)=Mm=1β(m)/ME(β|y)=Mm=1β(m)/M, where β(m)β(m) are generaed as follows:

    1. Generate σ2(m)σ2(m) from π(σ2|y)π(σ2|y), where σ2|yσ2|y is from IG(a,b)IG(a,b)
    1. For given σ2(m)σ2(m), generate β(m)β(m) from π(β|y,σ2(m))π(β|y,σ2(m)), where it is from normal distribution in Eq. (2)

Simulation Results

We use the same setup in the Section Importance Sampling Simulation. The final result is as follows:
E(β0)E(β0) E(β1)E(β1) Var(β0)Var(β0) Var(β1)Var(β1)
100 1.1099 1.9960 0.0073 0.0107
1000 1.1049 2.0004 0.0079 0.0084
5000 1.1081 1.9966 0.0081 0.0088

Markov Chain Monte Carlo Simulation

Gibbs Samping

Suppose σ2σ2 is unknown, we know σ2|y,β0,β1IG(a,b)σ2|y,β0,β1IG(a,b) β|y,σ2Np((XTX+1νIp)1XTy,(XTX+1νIp)1σ2)β|y,σ2Np((XTX+1νIp)1XTy,(XTX+1νIp)1σ2) hence we can get the conditional distribution of each parameter, β0|β1,y,σ2N(μ0+σ0σ1ρ(β1μ1),(1ρ2)σ21)β0|β1,y,σ2N(μ0+σ0σ1ρ(β1μ1),(1ρ2)σ21) β1|β0,y,σ2N(μ1+σ1σ0ρ(β0μ0),(1ρ2)σ20)β1|β0,y,σ2N(μ1+σ1σ0ρ(β0μ0),(1ρ2)σ20) where σ0=[1,0]A1σ2[1,0]Tσ0=[1,0]A1σ2[1,0]T, σ1=[0,1]A1σ2[0,1]Tσ1=[0,1]A1σ2[0,1]T, μ0=A1XTy[1,0]Tμ0=A1XTy[1,0]T and μ1=A1XTy[0,1]Tμ1=A1XTy[0,1]T .

To generate β(m)β(m) from π(β|y)π(β|y), we can use a MCMC method of Gibbs sampling. The Gibbs sampling, which is one of the most popular MCMC methods, can be implemented as follows:

Set the initial value (β(0)0,β(0)1,σ2(0))(β(0)0,β(0)1,σ2(0)). Then iterate the following steps fort=0,1,2,t=0,1,2,.

Step1: Generate σ2(t+1)σ2(t+1) from π(σ2|y,β(t)0,β(t)1)π(σ2|y,β(t)0,β(t)1).

Step2: Generate β(t+1)0β(t+1)0 from π(β0|y,β(t)1,σ2(t+1))π(β0|y,β(t)1,σ2(t+1)).

Step3: Generate β(t+1)1β(t+1)1 from π(β1|y,β(t+1)0,σ2(t+1))π(β1|y,β(t+1)0,σ2(t+1)).

We setup the total sample size is 5000, and burning period is 2000.

The final result is as follows:

E(β0)E(β0) E(β1)E(β1) Var(β0)Var(β0) Var(β1)Var(β1)
5000 1.1096 1.9955 0.0082 0.009

The diagnosis tool to assess the convergence of the sampler is as follows:

$\beta_0$; $\beta_1$; the marginals and the joint distribution$\beta_0$; $\beta_1$; the marginals and the joint distribution$\beta_0$; $\beta_1$; the marginals and the joint distribution

Figure 1: β0β0; β1β1; the marginals and the joint distribution

Bayesian V.S. Frequentist Case Study

Multivariable Normal Conditional Distribution

  • The conditional posterior of βjβj is obtained as follows: π(βj|βj,σ2,y)π(βj,βj|σ2,y)=π(β|σ2,y)f(β,y|σ2)=f(y|β,σ2)π(β|σ2)exp(12σ2yXβ2)exp(12σ2νβ2)exp[12σ2((yXjβjxjβj)T(yXjβjxjβj)+1νβTjβj)]exp[12σ2(yTxjβj+βTjXTjxjβjβTjxTjy+βTjxTjXjβj+βTjxTjxjβj+1νβTjβj)]exp[12σ2(2βTjxTjy+2βTjxTjXjβj+βTjxTjxjβj+1νβTjβj)]=exp[12σ2(βTj(xTjxj+1ν)βj2βTjxTj(yXjβj))]=exp[12σ2(xTjxj+1ν)(βTjβj2βTj(xTjxj+1ν)1xTj(yXjβj))]=exp[12σ2(xTjxj+1ν)(βj˜βj)2]π(βj|βj,σ2,y)π(βj,βj|σ2,y)=π(β|σ2,y)f(β,y|σ2)=f(y|β,σ2)π(β|σ2)exp(12σ2yXβ2)exp(12σ2νβ2)exp[12σ2((yXjβjxjβj)T(yXjβjxjβj)+1νβTjβj)]exp[12σ2(yTxjβj+βTjXTjxjβjβTjxTjy+βTjxTjXjβj+βTjxTjxjβj+1νβTjβj)]exp[12σ2(2βTjxTjy+2βTjxTjXjβj+βTjxTjxjβj+1νβTjβj)]=exp[12σ2(βTj(xTjxj+1ν)βj2βTjxTj(yXjβj))]=exp[12σ2(xTjxj+1ν)(βTjβj2βTj(xTjxj+1ν)1xTj(yXjβj))]=exp[12σ2(xTjxj+1ν)(βj~βj)2] where ˜βj=(xTjxj+1ν)1xTj(yXjβj)~βj=(xTjxj+1ν)1xTj(yXjβj), XjXj is a submatrix of XX without the jthjth column, and βjβj is a subvector of ββ without the jthjth element. Hence

βj|βj,σ2,yN((xTjxj+1ν)1xTj(yXjβj),σ2[xTjxj+1ν]1)βj|βj,σ2,yN((xTjxj+1ν)1xTj(yXjβj),σ2[xTjxj+1ν]1)

Setup

  • Consider

f(y|β,σ2)=(2πσ2)n2exp(12σ2yXβ2),f(y|β,σ2)=(2πσ2)n2exp(12σ2yXβ2), π(β|σ2)=(2πσ2ν)p2exp(12σ2νβ2).π(β|σ2)=(2πσ2ν)p2exp(12σ2νβ2).

Generate 30 samples from y=1+1×x1+2×x2+0×x3+0×x4+10×x5+ϵy=1+1×x1+2×x2+0×x3+0×x4+10×x5+ϵ where ϵN(0,2),xiN(0,1).β=(1,1,2,0,0,10)ϵN(0,2),xiN(0,1).β=(1,1,2,0,0,10)

  1. use frequentist way method to estimate ββ

  2. use Bayesian MC

  3. use Bayesian MCMC

Computation Algorithm

  • In frequentist way, that is OLS method, we know the estimate of ββ is ˆβ=(XTX)1XTy^β=(XTX)1XTy. The mean square of error is ˆββ2/6^ββ2/6.

  • In Bayesian MC way, we assume that

σ2IG(a0,b0)σ2IG(a0,b0)

β|σ2N(0,σ2νIp)β|σ2N(0,σ2νIp)

Then we have proved that

σ2|yIG(a,b)σ2|yIG(a,b), where a=n2+a0,b=b0+12yT(InXA1XT)ya=n2+a0,b=b0+12yT(InXA1XT)y

β|y,σ2N(A1XTy,A1σ2)β|y,σ2N(A1XTy,A1σ2), where A=(XTX+1νIp)A=(XTX+1νIp)

hence β(m)β(m) are generaed as follows:

Finally, we estimate ˆβ^β by computing E(β|y)=Mm=1β(m)/ME(β|y)=Mm=1β(m)/M. The mean square of error is ˆββ2/6^ββ2/6.

  • In Bayesian MCMC way, we have proved that

    1. σ2|y,βIG(a,b), where a=n+p2+a0,b=b0+12yXβ2+12νβ2
    1. β|y,σ2N(A1XTy,A1σ2),where A=(XTX+1νIp)
    1. βj|βj,σ2,yN((xTjxj+1ν)1xTj(yXjβj),σ2[xTjxj+1ν]1)

Then, we use Gibbs sampling method to get the β(m)=(β(m)0,β(m)1,,β(m)5). Set the initial values (β(0)0,β(0)1,,β(0)5,σ2(0)).Then iterate the following steps for t=0,1,2, as follows:

    1. Generate σ2(t+1) from π(σ2|y,β(t)0,β(t)1,,β(t)5).
    1. Generate β(t+1)0 from π(β0|y,β(t)0,σ2(t+1)).
    1. update β(t) with β(t)=(β(t+1)0,β(t)1,,β(t)5)
    1. Generate β(t+1)1 from π(β1|y,β(t)1,σ2(t+1)).
    1. update β(t) with β(t)=(β(t+1)0,β(t+1)1,β(t)2,,β(t)5) and similarly, generate β(t+1)2,,β(t+1)5 and update β(t) after generating β(t+1)i every time,

where β(t)i is a subvector of β without ith element.

We setup the total sample size is 5000, and burning period is 2000.

Finally, we estimate ˆβ by computing E(β|y)=Mm=2001β(m)/M. The mean square of error is ˆββ2/6.

Simulation Results

Bayesian V.S. Frequentist
β0 β1 β2 β3 β4 β5 MSE
Frequentist 1.194 0.991 2.080 -0.082 -0.051 9.608 0.035
Bayesian MC 1.195 0.972 2.084 -0.090 -0.068 9.570 0.041
Bayesian MCMC 1.187 0.986 2.071 -0.085 -0.058 9.581 0.038

Model Selection

Model Selection

Consider we have k candidate models, M={M1,M2,,Mk}. A Bayesian way to select the best model is to find the model which maximizes the model posterior probability given data, that is

ˆM=argmaxMMπ(M|y) where π(M|y)=f(y|M)π(M)˜MMf(y|˜M)π(˜M) in which, the denominator is free of M. note that the prior π(M)=1/k. Then we have π(M|y)=f(y|M)wMf(y|˜M) This leads to ˆM=argmaxMMf(y|M) where f(y|M)=f(y|θM,M)π(θM)dθM, where θM is the parameter under model M. We call f(y|M) as marginal likelihood under model M,f(y|θM,M) likelihood under model M and π(θM) prior under model M. Under M f(y|M)=f(y|θ,M)π(θ|M)dθ=f(y|β,σ2,M)π(β|σ2,M)π(σ2|M)dβdσ2=(2π)n2(σ2)n2exp[12σ2(yXβM)T(yXβM)]×(2π)pM2(σ2ν)pM2exp(12σ2νβTMβM)×ba11Γ(a1)(σ2)(a11)exp(b1σ2)dβMdσ2=ba11Γ(a1)νpM2(2π)(n+pM2)σ2((n+pM+2a1+2)2)×exp[12σ2(yTy2βTMXTy+βTMXTXβM+1νβTMβM+2b1)]dβMdσ2 Note that yTy2βTMXTy+βTMXTXβM+1νβTMβM+2b1=βTM(XTX+1νIpM)βM2βTM(XTX+1νIpM)(XTX+1νIpM)1XTy+yTy+2b1=βTMAβM2βTMA~βM+~βMTA~βM~βMTA~βM+yTy+2b1=(βM~βM)TA(β~βM)+yTyyTXA1XTy+2b1=(βM~βM)TA(βM~βM)+yT(InXA1XT)y+2b1 Therefore, f(y|M)=ba11Γ(a1)νpM2(2π)(n+pM2)σ2((n+pM+2a1+2)2)×exp[12σ2((βM~βM)TA(βM~βM)+yT(InXA1XT)y+2b1)]dβMdσ2=ba11Γ(a1)νpM2(2π)(n+pM2)σ2((n+pM+2a1+2)2)×(2π)pM2|A|12exp[12σ2(yT(InXA1XT)y+2b1)]dσ2=ba11Γ(a1)νpM2(2π)(n2)|A|12σ2(aM1)exp(bMσ2)dσ2=ba11Γ(a1)νpM2(2π)(n2)|A|12Γ(aM)b(aM)M=(2π)(n2)Γ(aM)b(aM)Mba11Γ(a1)νpM2|A|12 where A=XTX+1νIpM,~βM=A1XTyaM=n+pM2+a1,bM=12yT(InXA1XT)y+b1 After Laplace approximation, f(y|M) is equivalent to BIC for large n.

MCMC

Minghui Chen (IWDE)

Goal is to compute m(y)=f(y|θ)π(θ)dθ

π(θ|y)=f(y|θ)π(θ)f(y|θ)π(θ)dθ Let g(θ) be a density function of θ

Then,

g(θ)dθ=1g(θ)π(θ|y)π(θ|y)dθ=1g(θ)π(θ|y)f(y|θ)π(θ)m(y)dθ=1m(y)g(θ)f(y|θ)π(θ)π(θ|y)dθ=1m(y)=[g(θ)f(y|θ)π(θ)π(θ|y)dθ]1=E[g(θ)f(y|θ)π(θ)|y]1[Mm=1g(θ(m))f(y|θ(m))π(θ(m))/M]1 where θ(m) is from π(θ|y).

There maxmimum of m(y) is equivalent to maximize E[g(θ)f(y|θ)π(θ)|y]1

Now let’s use Minghui Chen’s method to maximize the f(y|Mi).

f(y|Mi)=f(y|θ,Mi)π(θ|Mi)dθ, where θ=(β0,β1,σ2) π(θ|y,Mi)=f(y|θ,Mi)π(θ|Mi)f(y|θ,Mi)π(θ|Mi)dθ=f(y|θ,Mi)π(θ|Mi)f(y|Mi) Let g(θ) be a density function of θ

Then

g(θ)dθ=1g(θ)π(θ|y,Mi)π(θ|y,Mi)dθ=1g(θ)π(θ|y,Mi)f(y|θ,Mi)π(θ|Mi)f(y|Mi)dθ=1f(y|Mi)g(θ)f(y|θ,Mi)π(θ|Mi)π(θ|y,Mi)dθ=1f(y|Mi)=[g(θ)f(y|θ,Mi)π(θ|Mi)π(θ|y,Mi)dθ]1=E[g(θ)f(y|θ,Mi)π(θ|Mi)|y,Mi]1[Nn=1g(θ(n))f(y|θ(n),Mi)π(θ(n)|Mi)/N]1 where θ(n) is from π(θ|y,Mi),i=1,2,,k.

The issue here is a choice of g(θ) to minimize MC error.

Recall π(θ|y)f(y|θ)π(θ)g(θ) Let g(θ) be a normal density such that Eg(θ)= posterior mean Mm=1θ(m)/M=ˉθ, Vg(θ)= posterior variance matrix Mm=1(θ(m)ˉθ)(θ(m)ˉθ)T/M, where θ(m) is a MCMC sample from π(θ|y).

For given model M, using MCMC, generate θ(k)M from π(θM|y,M), then compute ˉθM and VM.

Define g(θ|M)=N(θ;ˉθM,VM).

Compute

f(y|M)[Kk=1g(θ(k)|M)f(y|θ(k),M)π(θ(k)|M)/K]1

Bayesian Simulation Case Study

Setup

  • Consider

f(y|β,σ2)=(2πσ2)n2exp(12σ2yXβ2), π(β|σ2)=(2πσ2ν)p2exp(12σ2νβ2).

Generate 30 samples from y=1+1×x1+2×x2+0×x3+0×x4+10×x5+ϵ where ϵN(0,2),xiN(0,1).β=(1,1,2,0,0,10)

Therefore, the true model, say Mture, is y=1+1×x1+2×x2+10×x5+ϵ

    1. use Bayesian MC method to find true model
    1. use frequentist way method to find true model

Our candidate models are as follows: M1:y=β0+ϵ; M2:y=β0+β1X1+ϵ; M32:y=1+1×x1+2×x2+0×x3+0×x4+10×x5+ϵ

  • For bayesian way, compute

f(y|Mk) for k=1,2,,32 and select the best model ˆM, such that f(y|ˆM)f(y|Mk) If ˆM=Mtrue, then assign Zi=1, otherwise Zi=0.

Repeat the above procedure for R=500 times, and compute Ri=1ZiRP( selecting true model with Bayes method ).

  • For Frequentist way, similarly, build linear model for 32 candidate models, compute AIC and BIC. If the smallest AIC or BIC is ture model, then Wi=1, otherwise Wi=0. Repeat it 500 times and compute Ri=1WiR.

Consider the sample sizes are 20, 50, 100 and 300 so that we can see how sample size impacts the probability of selecting the true model.

  • For MCMC, sampling with MCMC method, then compute Eq. (4) for 32 candidates models, select the best model ˆM, such that f(y|ˆM)f(y|Mk) If ˆM=Mtrue, then assign Zi=1, otherwise Zi=0

Repeat the above procedure for R=500 times, and compute Ri=1ZiRP( selecting true model with MCMC method )

Result

MC AIC BIC MCMC Error
20 0.502 0.560 0.656 0.508 59.401
50 0.894 0.668 0.882 0.880 78.224
100 0.998 0.676 0.916 0.926 93.176
300 0.998 0.712 0.978 NA NA