Caleb

Stats Ph.D.

EM Algorithm

Caleb / 2021-03-14


EM (Expectation-Maximization) algorithm provides an alternative approach to find MLEs by formulating model model in terms of unobserved / missing data and observed / incomplete data. A good example and application of EM algorithm in the variable selection is EMVS (Rockova and George 2014).

Missing data example:

Missing data represents quatities that allow us to estimate the parameters more easily. Suppose that we have a mixture density \(pf(x) + (1-p)g(x)\), where \(p\) is unknown. If we observe sample \({\bf X}= (X_1,\dots,X_n)\), the sample density is

\[ \prod_{i=1}^n[pf(x_i)+(1-p)g(x_i)], \] which could be difficult to deal with.

The EM solution is to augment the observed/incomplete data with missing data \({\bf Z}= (Z_1,\dots,Z_n)\), where \(Z_i\) tells where the \(X_i\) comes from; that is

\[ X_i|z_i=1\sim f(x_i) ~\text{and}~ X_i|z_i=0\sim g(x_i); P(Z_i=1)=p. \]

The joint density of \((X_i,Z_i)\) is \(P(X_i,Z_i) = P(X_i|Z_i)P(Z_i)\):

\(P(X_i,Z_i=1)\) \(P(X_i,Z_i=0)\)
\(pf(x_i)\) \((1-p)g(x_i)\)

Therefore, joint density of \(({\bf X},{\bf Z})\) can be expressed as \(\prod_{i=1}^n[pf(x_i)]^{z_i}[(1-p)g(x_i)]^{1-z_i}\).

Then what is the missing data distribution, \(P(Z_i|X_i)\)?

\(P(Z_i=1|X_i) = \frac{P(Z_i=1,X_i)}{P(X_i)} = \frac{P(X_i|Z_i=1)p(Z_i=1)}{P(X_i)} = \frac{pf(x_i)}{pf(x_i)+(1-p)g(x_i)}\), \(P(Z_i=0|X_i) = \frac{P(Z_i=0,X_i)}{P(X_i)} = \frac{P(X_i|Z_i=0)p(Z_i=0)}{P(X_i)} = \frac{(1-p)g(x_i)}{pf(x_i)+(1-p)g(x_i)}\).

Therefore, \(Z_i|X_i\sim Ber(\frac{pf(x_i)}{pf(x_i)+(1-p)g(x_i)})\).

Mixture model

To describe the EM algorithm, we need to derive it for mixture models like we did in the missing data example. These models are specified as follows:

\[ f_{Y_i|\gamma_i,{\boldsymbol \beta},{\boldsymbol \theta}}(y_i|\gamma_i,{\boldsymbol \beta},{\boldsymbol \theta}) = \prod_{i=1}^n f_{Y_i|\gamma_i,{\boldsymbol \beta},{\boldsymbol \theta}}(y_i|\gamma_i,{\boldsymbol \beta},{\boldsymbol \theta})~~ \text{and} ~~ f_{{\boldsymbol \gamma}|{\boldsymbol \theta}}({\boldsymbol \gamma}|{\boldsymbol \theta})=\prod_{i=1}^n f_{\gamma_i|{\boldsymbol \theta}}(\gamma_i|{\boldsymbol \theta}), \]

where \({\boldsymbol \beta}\) and \({\boldsymbol \theta}\) are unknown. Then the marginal distribution of \({\bf y}\):

\(f_{{\bf Y}|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y}|{\boldsymbol \beta},{\boldsymbol \theta}) = \int_{\boldsymbol \gamma}f_{{\bf Y}|{\boldsymbol \gamma},{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y}|{\boldsymbol \gamma},{\boldsymbol \beta},{\boldsymbol \theta}) \times f_{{\boldsymbol \gamma}|{\boldsymbol \theta}}({\boldsymbol \gamma}|{\boldsymbol \theta}) d{\boldsymbol \gamma}\)

Notation:

EM algorithm derivatives

Goal: Estimate parameter vector \({\boldsymbol \beta}\) and \({\boldsymbol \theta}\) by maximizing the marginal loglikelihood function of \({\boldsymbol \beta}\) and \({\boldsymbol \theta}\), i.e. \(L({\boldsymbol \beta},{\boldsymbol \theta}) = \ln f_{Y|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y}|{\boldsymbol \beta},{\boldsymbol \theta})\).

We have two blocks of parameters, \({\boldsymbol \beta}\) and \({\boldsymbol \theta}\), because we wish to separate them. By Bayes theorem,

\[f_{{\bf Y}|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y}|{\boldsymbol \beta},{\boldsymbol \theta}) = \frac{f_{{\bf Y},{\boldsymbol \gamma}|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y},{\boldsymbol \gamma}|{\boldsymbol \beta},{\boldsymbol \theta})}{f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}(\gamma|{\bf y},{\boldsymbol \beta},{\boldsymbol \theta})}.\]

Alternatively, \[\underbrace{\ln f_{{\bf Y}|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y}|{\boldsymbol \beta},{\boldsymbol \theta})}_\color{red}{L({\boldsymbol \beta},{\boldsymbol \theta})} = \ln \underbrace{f_{{\bf Y},{\boldsymbol \gamma}|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y},{\boldsymbol \gamma}|{\boldsymbol \beta},{\boldsymbol \theta})}_\color{red}{\text{complete data density}} - \ln \underbrace{f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta},{\boldsymbol \theta})}_\color{red}{\text{missing data density}}.\]

Consider given the values of \({\boldsymbol \beta}\) and \({\boldsymbol \theta}\), \(\color{red}{{\boldsymbol \beta}_t}\) and \(\color{red}{{\boldsymbol \theta}_t}\) (at the \(t^{th}\) iteration), the expected value of \(\ln f_{{\bf Y}|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y}|{\boldsymbol \beta},{\boldsymbol \theta})\) with respective to \(f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},\color{red}{{\boldsymbol \beta}_t},\color{red}{{\boldsymbol \theta}_t})\) is

\[\begin{eqnarray} &&E_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}[\ln f_{{\bf Y}|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y},{\boldsymbol \beta},{\boldsymbol \theta})|{\bf y},\color{red}{{\boldsymbol \beta}_t},\color{red}{{\boldsymbol \theta}_t}] \\ &=& E_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}[\ln f_{{\bf Y},{\boldsymbol \gamma}|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y},{\boldsymbol \gamma}|{\boldsymbol \beta},{\boldsymbol \theta})|{\bf y},\color{red}{{\boldsymbol \beta}_t},\color{red}{{\boldsymbol \theta}_t}] - E_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}[\ln f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta},{\boldsymbol \theta})|{\bf y},\color{red}{{\boldsymbol \beta}_t},\color{red}{{\boldsymbol \theta}_t}]\tag{1} \end{eqnarray}\]

Since \(L({\boldsymbol \beta},{\boldsymbol \theta}) = \ln f_{{\bf Y}|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y}|{\boldsymbol \beta},{\boldsymbol \theta})\) does not depend on \({\boldsymbol \gamma}\), it is constant. Hence, (1) can be written as

\[\begin{eqnarray} L({\boldsymbol \beta},{\boldsymbol \theta}) = Q({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) - H({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t),\tag{2} \end{eqnarray}\] where \[ Q({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)= \int_{\boldsymbol \gamma}[\ln f_{{\bf Y},{\boldsymbol \gamma}|{\boldsymbol \beta},{\boldsymbol \theta}}({\bf y},{\boldsymbol \gamma}|{\boldsymbol \beta},{\boldsymbol \theta})]f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) d{\boldsymbol \gamma} \] is the expected value of complete data with respective to \(f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},\color{red}{{\boldsymbol \beta}_t},\color{red}{{\boldsymbol \theta}_t})\) and \[ H({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)= \int_{\boldsymbol \gamma}[\ln f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta},{\boldsymbol \theta})]f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) d{\boldsymbol \gamma} \] is expected value of missing data with respective to \(f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},\color{red}{{\boldsymbol \beta}_t},\color{red}{{\boldsymbol \theta}_t})\).

The key missing information principle

\[ H({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)\leq H({\boldsymbol \beta}_t,{\boldsymbol \theta}_t|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) \] for any values of \({\boldsymbol \beta}\) and \({\boldsymbol \theta}\).

Proof. \[\begin{eqnarray} &&H({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) - H({\boldsymbol \beta}_t,{\boldsymbol \theta}_t|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)\\ &=& E_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}[\ln f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta},{\boldsymbol \theta})|{\bf y},\color{red}{{\boldsymbol \beta}_t},\color{red}{{\boldsymbol \theta}_t}] - E_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}[\ln f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)|{\bf y},\color{red}{{\boldsymbol \beta}_t},\color{red}{{\boldsymbol \theta}_t}]\\ &=& E_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}\left\{\ln \left[ \frac{f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta},{\boldsymbol \theta})}{f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)}\right]\Bigg| {\bf y},\color{red}{{\boldsymbol \beta}_t},\color{red}{{\boldsymbol \theta}_t}\right\} \\ &\leq& \ln E_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}\left\{ \left[ \frac{f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta},{\boldsymbol \theta})}{f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)}\right]\Bigg| {\bf y},\color{red}{{\boldsymbol \beta}_t},\color{red}{{\boldsymbol \theta}_t}\right\} \\ &=& \ln\int_{\boldsymbol \gamma}\frac{f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta},{\boldsymbol \theta})}{f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)} f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)d{\boldsymbol \gamma}=\ln\int_{\boldsymbol \gamma}f_{{\boldsymbol \gamma}|{\bf Y},{\boldsymbol \beta},{\boldsymbol \theta}}({\boldsymbol \gamma}|{\bf y},{\boldsymbol \beta},{\boldsymbol \theta}) d{\boldsymbol \gamma}= 0. \end{eqnarray}\]

Continued

(2) can be rearranged as

\[\begin{eqnarray} Q({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) &=& L({\boldsymbol \beta},{\boldsymbol \theta}) +H({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)\\ &\leq& L({\boldsymbol \beta},{\boldsymbol \theta}) +H({\boldsymbol \beta}_t,{\boldsymbol \theta}_t|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) \end{eqnarray}\]

\[ ({\boldsymbol \beta}_{t+1},{\boldsymbol \theta}_{t+1}) = \mathcal{T}({\boldsymbol \beta}_t,{\boldsymbol \theta}_t) = \arg\max_{({\boldsymbol \beta},{\boldsymbol \theta})}Q({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) \] Therefore, \(Q({\boldsymbol \beta}_{t+1},{\boldsymbol \theta}_{t+1}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) \geq Q({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)\) for any values of \(({\boldsymbol \beta},{\boldsymbol \theta})\). - In this context, we define a fixed-point transformation as atransformation that maps \(({\boldsymbol \beta}_t,{\boldsymbol \theta}_t)\) onto itself, i.e. \(\mathcal{T}_*\) is a fixed-point transformation if \(({\boldsymbol \beta}_{t},{\boldsymbol \theta}_{t}) = \mathcal{T}_*({\boldsymbol \beta}_t,{\boldsymbol \theta}_t)\)

Proposition Missing Information Principle For a transformation defined by maximizing \(Q({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)\) and if \(L({\boldsymbol \beta},{\boldsymbol \theta})\) and \(H({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)\) are differentiable,

proof

(a): To prove the maiximizer of \(L({\boldsymbol \beta},{\boldsymbol \theta})\) is a fixed-point transformation, we need to prove given the maiximizer of \(L({\boldsymbol \beta},{\boldsymbol \theta})\), \((\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\) say, \((\hat{\boldsymbol \beta},\hat{\boldsymbol \theta}) = \mathcal{T}(\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\).

\[ Q({\boldsymbol \beta},{\boldsymbol \theta}|\hat{\boldsymbol \beta},\hat{\boldsymbol \theta}) = L({\boldsymbol \beta},{\boldsymbol \theta}) + H({\boldsymbol \beta},{\boldsymbol \theta}|\hat{\boldsymbol \beta},\hat{\boldsymbol \theta}) \] Then plug \((\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\) in it, we can observe on the right hand side of the above equation that both terms (\(L({\boldsymbol \beta},{\boldsymbol \theta})\) and \(H({\boldsymbol \beta},{\boldsymbol \theta}|\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\)) are maximized simultaneously, so is \(Q({\boldsymbol \beta},{\boldsymbol \theta}|\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\) maximized at \((\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\).

  1. proof by contradiction: Given any fixed-points \((\hat{\boldsymbol \beta},\hat{\boldsymbol \theta}) = \mathcal{T}(\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\), \(H({\boldsymbol \beta},{\boldsymbol \theta}|\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\) is also maximized. If \((\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\) is not the maximizer of \(L({\boldsymbol \beta},{\boldsymbol \theta})\), then

\[ Q(\hat{\boldsymbol \beta},\hat{\boldsymbol \theta}|\hat{\boldsymbol \beta},\hat{\boldsymbol \theta}) = L(\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})+ H(\hat{\boldsymbol \beta},\hat{\boldsymbol \theta}|\hat{\boldsymbol \beta},\hat{\boldsymbol \theta}) \] is neigher, which contradicts \((\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\) is maximizer or stationary point of \(Q({\boldsymbol \beta},{\boldsymbol \theta}|\hat{\boldsymbol \beta},\hat{\boldsymbol \theta})\).

An important property of the EM algorithm is that the observeddata log-likelihood function \(L({\boldsymbol \beta}, {\boldsymbol \theta})\) increases at each iteration or, more precisely, does not decrease. To see this, consider we have \(t^{th}\) and \((t+1)^{th}\) values \(({\boldsymbol \beta}_t,{\boldsymbol \theta}_t)\) and \(({\boldsymbol \beta}_{(t+1)},{\boldsymbol \theta}_{(t+1)})\), then

\[\begin{eqnarray} &&L({\boldsymbol \beta}_{(t+1)},{\boldsymbol \theta}_{(t+1)}) - L({\boldsymbol \beta}_{(t)},{\boldsymbol \theta}_{(t)})\\ &=& \underbrace{Q({\boldsymbol \beta}_{t+1},{\boldsymbol \theta}_{t+1}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) - Q({\boldsymbol \beta}_t,{\boldsymbol \theta}_t|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)}_{\color{red}{\geq 0,~ since~ Q~ is~ maximized}} + \underbrace{H({\boldsymbol \beta}_{t},{\boldsymbol \theta}_{t}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t) -H({\boldsymbol \beta}_{t+1},{\boldsymbol \theta}_{t+1}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)}_{\color{red}{\geq 0,~ since~ H({\boldsymbol \beta}_{t},{\boldsymbol \theta}_{t}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t)~ is~ maximum}} \end{eqnarray}\]

Summarize EM algorithm

\[ ({\boldsymbol \beta}_{t+1},{\boldsymbol \theta}_{t+1}) = \arg\max_{({\boldsymbol \beta},{\boldsymbol \theta})} Q({\boldsymbol \beta},{\boldsymbol \theta}|{\boldsymbol \beta}_t,{\boldsymbol \theta}_t). \]

Reference

Rockova, Veronika, and Edward George. 2014. “EMVS: The Em Approach to Bayesian Variable Selection.” Journal of the American Statistical Association 109 (June). https://doi.org/10.1080/01621459.2013.869223.