Caleb

Stats Ph.D.

EMVS: The EM Approach to Bayesian Variable Selection

Caleb Jin / 2017-12-15


Introduction

The EMVS (Rockova and George 2014) method is anchored by EM algorithm and original stochastic search variable selection (SSVS). It is a deterministic alternative to MCMC stochastic search and ideally suited for high-dimensional p>n settings. Furthermore, EMVS is able to effectively identify the sparse high-probability model and find candidate models within a fraction of the time needed by SSVS. This algorithm makes it possible to carry out dynamic posterior exploration for the identification of posterior modes.

The outline of this course project is as follows. In section 2, we build original SSVS method, including its prior and posterior distribution for each parameter. In section 3, I derive the author’s new idea EMVS method in very details that the author omits. In section 4, I perform several simulation case studies from this paper and me, then compare the results of SSVS and EMVS.

Stochastic Search Variable Selection (SSVS)

Model Settings

Consider a multiple linear regression model as follows: (1)y=Xβ+ϵ, where y=(y1,,yn) is the n-dimensional response vector, X=[x1,,xn] is the n×p design matrix, β=(β1,β2,,βp), σ2 is a scalar, and ϵNn(0,σ2In). Both σ2 and β are considerd bo te unknown. We assume that p>n. From (1), the likelihood function is given as y|β,σ2N(Xβ,σ2In).

The cornerstone of SSVS is the “spike and slab” Gaussian mixture prior on β, βj|σ2,γjind(1γj)N(0,σ2ν0)+γjN(0,σ2ν1), where (2)γj={1βj00βj=0, j=1,2,,p. 0ν0<ν1.

Hence, β|σ2,γNp(0,Dσ2,γ), where Dσ2,γ=σ2diag(νz1,νz1,,νzp).

For the prior on the σ2, I follow the (George and McCulloch 1997) and use inverse gamma prior but different notation, σ2IG(a2,b2), where we will use a=1 and b=1, and σ2 is also independent of γ.

I follow the (George and McCulloch 1997) and use Bernoulli prior form but different notation, γj|θiidBer(θ),θ[0,1]. Hence, (3)γθj=1pγj(1θ)pj=1pγj We assume θB(c1,c2), where we will use c1=c2=1, which leads to the uniform distribution.

It is easy to show that the full conditionals are as follows:

    1. βj|βj,σ2,γ,ν0,yN(β~j,σ2μj). where μj=xjxj+1νγj and β~j=μj1xj(yXjβj).
    1. γj|γj,β,σ2,yBer(ν112exp(12σ2ν1βj2)θν012exp(12σ2ν0βj2)(1θ)+ν112exp(12σ2ν1βj2)θ).
    1. σ2|β,γ,yIG(a,b). where a=12(n+p+a) and b=12(yXβ2+j=1pβj2νγj+b).
    1. For θ|γ, we have

π(θ|γ)π(γ|θ)π(θ)[j=1pθγj(1θ)1γj]θc11(1θ)c21θj=1pγj+c11(1θ)pj=1pγj+c21, where we will use c1=c2=1, which leads to the uniform distribution. We therefore have θ|γB(j=1pγj+c1,pj=1pγj+c2).

EMVS

Genared toward finding posterior modes of the parameter posterior π(β,σ2,θ|y) rather than simulating from the entire model posterior π(γ|y), the EM algorithm derived here offers potentially enormous computational savings over stochastic search alternatives, especially in problems with a large number p of potential predictors.

EMVS consists of two steps: E-Step, conditional expectation given the observed data and current parameter estimates, and M-Step, entails the maximization of the expected complete data log posterior with respect to (β,σ2,θ).

Define ϕ=(β,σ2,θ). EM algorithm maximizes π(ϕ|y) by iteratively maximizing the objective function Q(ϕ|ϕ(t1),y)=Eγ|[logπ(ϕ,γ|y)], where Eγ| denotes Eγ|ϕ(t1),y.

According to the class note, it can be written as (4)Q(ϕ|ϕ(t1),y)=Eγ|[logπ(ϕ|γ,y)] At the (t1)th iteration, given ϕ(t1), an E-step is first applied, which computes the expectation of the right side of (4) to obtain Q. This is followed by M-step, which maximizes Q over (β,σ2,θ) to obtain (β(t),σ2(t),θ(t)).

Q(β,σ2,θ|β(t),σ2(t),θ(t),y)=Q(β,σ2,θ|ϕ(t1),y)=C+Q1(β,σ2|ϕ(t1),y)+Q2(θ|ϕ(t1),y)=C+Eγ|[logπ(β,σ2|ϕ(t1),y)]+Eγ|[logπ(θ|ϕ(t1),y)]

Q1(β,σ2|ϕ(t1),y)=Eγ|[log(π(β,σ2|ϕ(t1),y))]=12(n+p+a+2)log(σ2)12σ2(||yxβ||2+b)12σ2j=1pβj2Eγ|[1νγj]

log(π(θ|γ,y))=C+(j=1pγj+c11)logθ+(pj=1pγj+c21)log(1θ)=C+logθ1θj=1pγj+(c11)logθ+(p+c21)log(1θ)

Q2(θ|ϕ(t1),y)=logθ1θj=1pEγ|[γj]+(c11)logθ+(p+c21)log(1θ)

The E-Step

The E-step proceeds by computing the conditional expectations Eγ|[1νγj] and Eγ|[γj] for Q1 and Q2, respectively.

    1. For Eγ|[γj],

π(γj|θ,βj,σ2,y)π(βj|γj,σ2)p(γj|θ)νγj12exp(12σ2νγjβj2)θγj(1θ)1γj

Hence, π(γj=1|θ,βj,σ2,y)=Cν112exp(12σ2ν1βj2)θai,

π(γj=0|θ,βj,σ2,y)=Cν012exp(12σ2ν0βj2)(1θ)bi.

Hence,

Eγ|[γj]=π(γj=1|θ,βj,σ2,y)=π(γj=1,Ω)γjπ(γj,Ω)=Cν112exp(12σ2ν1βj2)θCν112exp(12σ2ν1βj2)θ+Cν012exp(12σ2ν0βj2)(1θ)=ν112exp(12σ2ν1βj2)θν112exp(12σ2ν1βj2)θ+ν012exp(12σ2ν0βj2)(1θ)=pi

Hence,

Q2(θ|ϕ(t1),y)(5)=logθ1θj=1ppi+(c11)logθ+(p+c21)log(1θ)

    1. For Eγ|[1νγj], Eγ|[1νγj]=p(γj=1|)1νγj=1+p(γj=0|)1νγj=0=piν1+1piν0di Hence, Q1(β,σ2|ϕ(t1),y)=12(n+p+a+2)log(σ2)12σ2(||yxβ||2+b)12σ2j=1pβj2di=12(n+p+a+2)log(σ2)12σ2(||yxβ||2+b)(6)12σ2||D1/2β||2, where D=diag{di}i=1p

The M-Step

    1. Maximize Q1

      β values that maximizes Q1 in Eq.(6) is equivalent to β(t)=argminβRp||yxβ||2+||D1/2β||2. This is quickly obtained by the well-known solution to the ridge regression problem: β(t)=(XX+D)1Xy

      Given β(t), solve the solution of Q1σ2, we can easily obtain σ2(t)=||yXβ(t)||2+||D1/2β(t)||2+bn+p+a+2.

    1. Maximize Q2

Q2θ=1θθ1θ+θ(1θ)2j=1ppj+a1θp+b11θ=j=1ppjθ(1θ)+a1θp+b11θ=j=1ppj(a+b+p2)θ+a1θ(1θ)0 Hence, we have θ(t)=j=1ppj+a1a+b+p2.

Simulation Case Study

According to this paper, the author simulate a dataset consisting of n=100 observations and p=1000 predictors. Predictor values for each observation were sampled from Np(0,Σ) where Σ=(ρij)i,j=1p with ρij=0.6|ij|. Response values were then generated according to the linear model y=Xβ+ϵ, where β=(3,2,1,0,0,,0) and ϵNn(0,σ2In) with σ2=3.

The R code is shown below

library(mvtnorm)
library(ggplot2)
library(reshape2)
n <- 100
p <- 1000
beta <-  c(3,2,1,rep(0,p-3))
Sigma <- matrix(NA,p,p)
for (i in 1:p) {
  for (j in 1:p) {
    Sigma[i,j] = 0.6^(abs(i-j))
  }
}
set.seed(2144)
x <- matrix(rmvnorm(n,mean = rep(0,p), sigma = Sigma),nrow = n)
y <- x%*%beta + rnorm(n,0,sqrt(3))
## prior and initial values
MC = 20
nu0 = 0.5
nu1 <- 1000
hat.theta = rep(1,MC)
hat.theta[1] = 0.5
hat.gamma <- rep(1, p)
hat.beta = matrix(NA,nrow = MC, ncol = p)
hat.beta[1,] = rep(1,p)
hat.sig2 <- rep(NA,MC)
hat.sig2[1] <- 1#mean((y - x %*% hat.beta[1,])^2)

for (i in 2:MC) {
  a <- dnorm(hat.beta[i-1,],0,sd = sqrt(hat.sig2[i-1]*nu1))*(hat.theta[i-1])
  b <- dnorm(hat.beta[i-1,],0,sd = sqrt(hat.sig2[i-1]*nu0))*(1-hat.theta[i-1])
  p.star = a/(a+b)  
  D.star <- diag((1-p.star)/nu0+p.star/nu1)
  D.star.5 <- diag(sqrt((1-p.star)/nu0+p.star/nu1))
  hat.beta[i,] <- solve(t(x)%*%x+D.star)%*%t(x)%*%y
  hat.sig2[i] <- (t(y-x%*%hat.beta[i,])%*%(y-x%*%hat.beta[i,]) +
                    t(D.star.5%*%hat.beta[i,])%*%(D.star.5%*%hat.beta[i,]) + 1)/(n+p+1+1)
  hat.theta[i] <- sum(p.star)/(p)
  # hat.theta[i] = 0.5

  ##################################################
  ## uncomment if you want to see the convergence ##

  # plot(beta,hat.beta[i,], ylim = c(-0.5,3))
  # abline(h = 0, col = 4)
  # abline(a=0,b=1,lty=2)
  # print(i)
}
data1 <- data.frame(beta=beta,hat.beta=hat.beta[MC,])
fig1 <- ggplot(data1,aes(beta,hat.beta))+
  geom_point(color = 'red',size = 2,alpha = 0.7)+
  ylab(expression(hat(beta)))+
  xlab(expression(beta))+
  ylim(c(-0.3,3)) +
  ggtitle(label = 'MAP Estimate versus True Coefficients')+
  geom_abline(intercept = 0, slope = 1,lty = 2)+
  theme(axis.title.x = element_text(size=12),
        axis.title.y = element_text(size=12),
        axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12))
# print hat.theta and hat.sigma
print(round(c(hat.theta[MC],sqrt(hat.sig2[MC])),4))
## [1] 0.0031 0.0404

Beginning with an illustration of the EM algorithm from the section 3, we apply it to the simulated data using the spike-and-slab prior (2) with a single value ν0=0.5,ν1=1000, and the θU(0,1). The starting values for the EM algorithm were set to σ2(0)=1 and β(0)=1p. After only four iterations, the algorithm obtained the modal coefficient estimates β^ depicted in Figure 1, displaying the estimated β and true β. In my case, the associated modal estimates of θ^ and σ2^ were 0.003 and 0.040, respectively; which are very similar to results of the paper.

For comparison, I applied the same formulation except with the Bernoulli prior (3) under fixed θ=0.5 Figure 2. Note the inferiority of the estimates near zero due to the lack of adaptivity of the Bernoulli prior in determining the degree of underlying sparsity.

Modal estimates of the regression coefficients

Figure 1: Modal estimates of the regression coefficients

Modal estimates of the regression coefficients

Figure 2: Modal estimates of the regression coefficients

The effect of different ν0 and β(0) values on variable selection

Rather than fixing ν0 as 0.5, I vary it from 0 to 0.5 to see its effect on the variable selection. I imitate the author’s method to consider the grid of ν0 values V={0.01+k×0.01:k=0,,50} again with ν1=1000 fixed and β(0)=1p. Figure 3 shows the modal estimates of regression coefficients obtained for each ν0V. We can discover that only when ν0>0.15, we can get a good estimation for the β.I also consider the effect different initial values of β on the variable selection. Fix ν0=0.5 and ν1=1000, I set the grid of β(0)=cp where c is a p-dimension vector with each value ci={5+0.2×k:k=0,,50},i=1,2,,p. Figure 4 depicts modal estimates of regression coefficients with different initial values of β(0). We discover only when |β(0)|<2, we can get good estimation for the β.

I also consider the effect different initial values of β on the variable selection. Fix ν0=0.5 and ν1=1000, I set the grid of β(0)=cp where c is a p-dimension vector with each value ci={5+0.2×k:k=0,,50},i=1,2,,p. Figure 4 depicts modal estimates of regression coefficients with different initial values of β(0). We discover only when |β(0)|<2, we can get good estimation for the β.

Modal estimates of the regression coefficients

Figure 3: Modal estimates of the regression coefficients

Modal estimates of the regression coefficients

Figure 4: Modal estimates of the regression coefficients

Reference

George, Edward I., and Robert E. McCulloch. 1997. “APPROACHES for Bayesian Variable Selection.” Statistica Sinica 7 (2): 339–73. http://www.jstor.org/stable/24306083.

Rockova, Veronika, and Edward George. 2014. “EMVS: The Em Approach to Bayesian Variable Selection.” Journal of the American Statistical Association 109 (June). https://doi.org/10.1080/01621459.2013.869223.