Caleb

Stats Ph.D.

Approaches for Bayesian Variable Selection (SSVS)

Caleb Jin / 2017-04-17


Foreword

After I read this paper, Approaches for Bayesian Variable Selection (SSVS) (George and McCulloch 1997) and (George and McCulloch 1993), I write down the nodes of the key idea and R code to realize it.

Model setting

Consider a high dimensional linear regression model as follows: \[\begin{eqnarray} {\bf y}={\bf X}{\boldsymbol \beta}+ {\boldsymbol \epsilon} \tag{1} \end{eqnarray}\] where \({\bf y}=(y_1,y_2,\ldots,y_{n})^{{\top}}\) is the \(n\)-dimensional response vector, \({\bf X}=[1,{\bf M}]=[{\bf x}_1,\ldots,{\bf x}_{p}]\) is the \(n\times p\) design matrix, and \({\boldsymbol \epsilon}\sim \mathcal{N}_{n}({\bf 0},\sigma^2{\bf I}_{n})\). Note that as \(p>n\), \({\bf X}\) is not full rank.

From (1), the likelihood function is given as \[ {\bf y}|{\boldsymbol \beta}, \sigma^2 \sim \mathcal{N}({\bf X}{\boldsymbol \beta}, \sigma^2I_n). \] We define the prior as follows: \[ \beta_j|\sigma^2,\gamma_j \stackrel{ind}{\sim} (1-\gamma_j) \mathcal{N}(0, \sigma^2\nu_0) + \gamma_j\mathcal{N}(0,\sigma^2\nu_1), \] where \(\nu_0\) and \(\nu_1\) will be chosen to be small and large, respectively. Note that the likelihood is independent of \({\boldsymbol \gamma}=(\gamma_1,\ldots,\gamma_p)\). Assume \[ \sigma^2 \sim \mathcal{IG}(\frac{a}{2}, \frac{b}{2}), \] which is also independent of \({\boldsymbol \gamma}\). We consider \[ \gamma_j \stackrel{iid}{\sim} Ber(\omega). \]

To make our model robust to the choice of \(\omega\), we will assign the following prior on \(\omega\). \[w\sim \mathcal{B}(c_1,c_2),\] where we will use \(c_1=c_2=1\), which leads to the uniform distribution. Recall the density function of beta distribution is proportional \(\pi(w)\propto w^{c_1-1}(1-w)^{c_2-1}\).

It is easy to show that the full conditionals are as follows:

To speed up, we consider the following conditionals:

Rcode

library(invgamma)
p <- 100
n = 100
power <- numeric()
e<-rnorm(n, mean = 0, sd = sqrt(2)) # error
X<-matrix(data = rnorm(n * p, 0, 1), nrow = n, ncol = p)
Beta <- c(1, 2, rep(0,(p - 3)), 3)# exclude intercept beta0
y <- X %*% Beta + e # true model
true.gamma<-as.numeric(Beta != 0)
#setup for initial values#####
hat.beta <- as.numeric(solve(t(X) %*% (X) + diag(1, p)) %*% t(X) %*% y) #p-dim vector
hat.gamma <- rep(1, p)
hat.sig2 <- mean((y - X %*% hat.beta)^2)
#setup for priors ############
w <- 0.5
v0 <- 0.001
v1 <- 1000
v01 <- c(v0, v1)
a0 <- 1
b0 <- 1
###############################
MC.size <- 2000 + 3000
hat.BETA <- matrix(0, MC.size, p) # to store beta for each iteration
hat.Gamma <- matrix(0, MC.size, p) # to store z for each iteration
hat.Sig2 <- rep(0, MC.size) # to store variance for each iteration
for (goh in 1:MC.size) {
  # Gibbs sampling 
  # 1) for beta_j
  for (j in 1:p) {
    mu_j <- t(X[, j]) %*% X[, j] + 1/v01[(hat.gamma[j] + 1)]
    y.star <- y - X[, -j] %*% hat.beta[-j]
    tilde.beta.j <- as.numeric((1/mu_j) * t(X[, j]) %*% y.star)
    var.beta <- as.numeric(hat.sig2/mu_j)
    hat.beta[j] <- rnorm(1, tilde.beta.j, sqrt(var.beta))  #sampling from beta_j|others
  }
  # 2) gamma_j
  p.j <- dnorm(hat.beta, 0, sqrt(v1 * hat.sig2)) * w
  q.j <- dnorm(hat.beta, 0, sqrt(v0 * hat.sig2)) * (1 - w)
  prob.j <- p.j/(p.j + q.j)
  hat.gamma <- rbinom(p, 1, prob.j)
  hat.Gamma[goh, ] <- hat.gamma
  hat.BETA[goh, ] <- hat.beta
  # 3) sig2
  a.star <- 1/2 * (n + p + a0)
  v.z_j <- hat.gamma * v1 + (1 - hat.gamma) * v0
  b.star <- 1/2 * (sum((y - X %*% hat.beta)^2) + sum(hat.beta^2/v.z_j) + b0)
  hat.sig2 <- rinvgamma(1, shape = a.star, rate = b.star)
  par(mfrow=c(1,1))
  plot(hat.gamma, main = paste("rep:", goh))
  points(true.gamma, col = 2, pch = "*")
}
colMeans(hat.Gamma)>0.5

Appendix

\[\begin{eqnarray*} \pi(\beta_j|{\boldsymbol \beta}_{-j}, \sigma^2, {\boldsymbol \gamma}, {\bf y})&\propto&f({\bf y}|{\boldsymbol \beta}, \sigma^2)\pi({\boldsymbol \beta}|{\boldsymbol \gamma}, \sigma^2)\\ &=&f({\bf y}|{\boldsymbol \beta}_{{\boldsymbol \gamma}}, \sigma^2) \prod_{k=1}^{p}\pi(\beta_k|Z_k, \sigma^2)\\ &\propto&f({\bf y}|{\boldsymbol \beta}, \sigma^2)\pi(\beta_j|\gamma_j, \sigma^2)\\ &\propto& \exp\left(-\frac{1}{2\sigma^2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2 \right) \exp\left(-\frac{1}{2\sigma^2\nu_{\gamma_j}}\beta_j^2\right)\\ &=& \exp\left(-\frac{1}{2\sigma^2}\|{\bf y}- {\bf X}_{-j}{\boldsymbol \beta}_{-j} - x_j \beta_j\|^2\right) \exp\left(-\frac{1}{2\sigma^2\nu_{\gamma_j}}\beta_j^2\right)\\ &=& \exp\left(-\frac{1}{2\sigma^2}\|{\bf y}^* - x_j \beta_j\|^2\right) \exp\left(-\frac{1}{2\sigma^2\nu_{\gamma_j}}\beta_j^2\right)\\ &=&\exp\left[-\frac{1}{2\sigma^2}\left({{\bf y}^*}^{{\top}}{\bf y}^* - 2\beta_j x_j^{{\top}}{\bf y}^* + \beta_j x_j^{{\top}}x_j\beta_j + \frac{1}{\nu_{\gamma_j}}\beta_j^2\right)\right]\\ &\propto&\exp\left[-\frac{1}{2\sigma^2}\left(\beta_j^2(x_j^{{\top}}x_j + \frac{1}{\nu_{\gamma_j}}) - 2\beta_j x_j^{{\top}}{\bf y}^* \right)\right]\\ &=&\exp\left[-\frac{1}{2\sigma^2}\left(\beta_j^2(x_j^{{\top}}x_j + \frac{1}{\nu_{\gamma_j}}) - 2\beta_j (x_j^{{\top}}x_j + \frac{1}{\nu_{\gamma_j}})(x_j^{{\top}}x_j + \frac{1}{\nu_{\gamma_j}})^{-1} x_j^{{\top}}{\bf y}^* \right)\right]\\ &=& \exp\left[-\frac{a}{2\sigma^2}\left(\beta_j^2 - 2a \beta_j \tilde \beta_j\right)\right]\\ &\propto& \exp\left[-\frac{a}{2\sigma^2}\left(\beta_j -\tilde \beta_j\right)^2\right]\\ \end{eqnarray*}\]

where \({\bf y}^* = {\bf y}- {\bf X}_{-j}{\boldsymbol \beta}_{-j}, \quad \mu_j= x_j^{{\top}}x_j + \frac{1}{\nu_{\gamma_j}},\quad \tilde \beta_j = \mu_j^{-1}x_j^{{\top}}{\bf y}^*\). Hence, \[\beta_j|{\boldsymbol \beta}_{-j}, \sigma^2, {\boldsymbol \gamma}, {\bf y}\sim \mathcal{N}(\tilde \beta_j, \frac{\sigma^2}{\mu_j}).\]

\[\begin{eqnarray*} \pi(\gamma_j|{\boldsymbol \gamma}_{-j}, {\boldsymbol \beta}, \sigma^2, {\bf y})&\propto& \pi({\boldsymbol \gamma},{\boldsymbol \beta}|\sigma^2)\\ &=&\pi({\boldsymbol \beta}|\sigma^2, {\boldsymbol \gamma})\pi({\boldsymbol \gamma})\\ &=& \prod_{k=1}^{p}\pi(\beta_{k}|\sigma^2, z_{k})\prod_{i=1}^{p}\pi(z_i)\\ &\propto&\pi(\beta_{j}|\sigma^2, z_{j})\pi(\gamma_j)\\ &\propto&\nu_{\gamma_j}^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_{\gamma_j}}\beta_j^2\right)\omega^{\gamma_j} (1-\omega)^{1-\gamma_j}. \end{eqnarray*}\]

Note that

\[\begin{eqnarray*} \pi(\gamma_j = 0|{\boldsymbol \gamma}_{-j}, {\boldsymbol \beta}, \sigma^2, {\bf y})&=& C \nu_0^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_0}\beta_j^2\right)\left(1-\omega\right);\\ \pi(\gamma_j = 1|{\boldsymbol \gamma}_{-j}, {\boldsymbol \beta}, \sigma^2, {\bf y})&=& C \nu_1^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_1}\beta_j^2\right)\omega. \end{eqnarray*}\]

This implies that

\[\begin{eqnarray*} \pi(\gamma_j = 1|{\boldsymbol \gamma}_{-j}, {\boldsymbol \beta}, \sigma^2, {\bf y}) &=&\frac{P(\gamma_j = 1,\Omega)}{\sum_{\gamma_j}P(\gamma_j,\Omega)}\\ &=& \frac{C \nu_1^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_1}\beta_j^2\right)\omega} { C \nu_0^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_0}\beta_j^2\right)\left(1-\omega\right)+ C \nu_1^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_1}\beta_j^2\right)\omega}\\ &=& \frac{ \nu_1^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_1}\beta_j^2\right)\omega} { \nu_0^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_0}\beta_j^2\right)\left(1-\omega\right)+ \nu_1^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_1}\beta_j^2\right)\omega}. \end{eqnarray*}\]

Hence, \[ \gamma_j |{\boldsymbol \gamma}_{-j}, {\boldsymbol \beta}, \sigma^2, {\bf y}\sim Ber\left(\frac{ \nu_1^{-\frac{1}{2}} \exp\left(-\frac{1}{2\sigma^2\nu_1}\beta_j^2\right)\omega} { \nu_0^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_0}\beta_j^2\right)\left(1-\omega\right)+ \nu_1^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_1}\beta_j^2\right)\omega}\right). \]

\[\begin{eqnarray*} \pi(\sigma^2|{\boldsymbol \beta}, {\boldsymbol \gamma}, {\bf y})&\propto&f({\bf y}|{\boldsymbol \beta},\sigma^2,{\boldsymbol \gamma})\pi({\boldsymbol \beta},\sigma^2|{\boldsymbol \gamma})\\ &=&f({\bf y}|{\boldsymbol \beta}, \sigma^2, {\boldsymbol \gamma})\pi({\boldsymbol \beta}|\sigma^2,{\boldsymbol \gamma})\pi(\sigma^2|{\boldsymbol \gamma})\\ &=&f({\bf y}|{\boldsymbol \beta}, \sigma^2)\pi({\boldsymbol \beta}|\sigma^2,{\boldsymbol \gamma})\pi(\sigma^2)\\ &\propto&(\sigma^2)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\sigma^2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2\right)\times \prod_{j=1}^{p}\left[(\sigma^2\nu_{\gamma_j})^{-\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2\nu_{\gamma_j}} \beta_j^2\right)\right]\\ &&\times(\sigma^2)^{-\frac{a}{2}-1}\exp\left(-\frac{b}{2\sigma^2}\right)\\ &\propto&(\sigma^2)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\sigma^2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2\right)\times (\sigma^2)^{-\frac{p}{2}}\exp\left(-\frac{1}{2\sigma^2}\sum_{j=1}^{p}\frac{\beta_j^2}{\nu_{\gamma_j}}\right)\\ &&\times(\sigma^2)^{-\frac{a}{2}-1}\exp\left(-\frac{b}{2\sigma^2}\right)\\ &=& (\sigma^2)^{-\frac{1}{2}(n+p+a)-1}\exp\left(-\frac{\frac{1}{2}\left(\|{\bf y}-{\bf X}{\boldsymbol \beta} \|^2 + \sum_{j=1}^{p}\frac{\beta_j^2}{\nu_{\gamma_j}} + b\right)}{\sigma^2}\right)\\ &=&(\sigma^2)^{-a^* -1}\exp(-\frac{b^*}{\sigma^2}), \end{eqnarray*}\]

where \(a^*=\frac{1}{2}(n+p+a)\) and \(b^* = \frac{1}{2}\left(\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2 + \sum_{j=1}^{p}\frac{\beta_j^2}{\nu_{\gamma_j}} + b\right).\)

Therefore, \[\sigma^2|{\boldsymbol \beta}, {\boldsymbol \gamma}, {\bf y}\sim \mathcal{IG}(a^*,b^*).\]

\[\begin{eqnarray*} \pi(w|{\boldsymbol \beta},{\boldsymbol \gamma},\sigma^2,{\bf y})&\propto& \pi({\boldsymbol \gamma}|w) \pi(w)\\ &\propto&\left[ \prod_{j=1}^p w^{\gamma_j}(1-w)^{1-\gamma_j}\right] w^{c_1-1}(1-w)^{c_2-1}\\ &\propto& w^{\sum_{j=1}^p \gamma_j+c_1-1}(1-w)^{p-\sum_{j=1}^p \gamma_j+c_2-1}. \end{eqnarray*}\]

We therefore have \[w|{\boldsymbol \beta},{\boldsymbol \gamma},\sigma^2,{\bf y}\sim \mathcal{B}\left(\sum_{j=1}^p \gamma_j+c_1,p-\sum_{j=1}^p \gamma_j+c_2\right).\]

Reference

George, Edward I., and Robert E. McCulloch. 1993. “Variable Selection via Gibbs Sampling.” Journal of the American Statistical Association, 881–89.

———. 1997. “APPROACHES for Bayesian Variable Selection.” Statistica Sinica 7 (2): 339–73. http://www.jstor.org/stable/24306083.