Caleb

Stats Ph.D.

Bayesian Reduced Rank Regression in Econometrics

Caleb Jin / 2018-05-09


Foreword

After I read this paper, Bayesian Reduced Rank Regression in Econometrics (Geweke 1996), I write down the nodes of the key idea and R code to realize it.

Introduction

In contemporary sociey, a big amount of data are generated and collected more easily and routinely in various academic and industrial areas, such as engineering, politics, B2C e-ccommerce, genomics, etc. Many problems could be cast into statistical problems under the framework of multivariate linear regression model, which is characterized by that both reponse variables and predictors are high dimensionality.

We assume \(n\) independent observations of the response \({\bf y}_i\in \mathcal{R}^q\) with predictor vector \({\bf x}_i\in\mathcal{R}^p\), \(i=1,2,\ldots,n\). Cosider multivariate linear regression model as follows: \[\begin{eqnarray} {\bf Y}={\bf X}{\bf C}+ {\bf E}, \tag{1} \end{eqnarray}\] where \({\bf Y}=({\bf y}_1,{\bf y}_2,\ldots,{\bf y}_n)^{{\top}}\in\mathcal{R}^{n\times q}\) is the response matrix, \({\bf X}=({\bf x}_1,{\bf x}_2,\ldots,{\bf x}_n)^{{\top}}\in\mathcal{R}^{n\times p}\) is the design matrix, \({\bf C}\in\mathcal{R}^{p\times q}\) is the coefficient matrix, and \({\bf E}=({\bf e}_1,{\bf e}_2,\ldots,{\bf e}_n)^{{\top}}\in\mathcal{R}^{n\times q}\) is the disturbance matrix with \({\bf e}_i\)’s \(\overset{iid}{\sim}\mathcal{N}_q({\bf 0},{\boldsymbol \Sigma}_e)\). We assume \({\boldsymbol \Sigma}_e=\sigma^2{\bf I}_q\). Therefore, we have \({\bf Y}\sim\mathcal{MN}({\bf X}{\bf C},{\boldsymbol \Sigma}_e,{\bf I}_n)\).

Bayesian Rank Reduced Regression model

We consider to decompose the coefficient matrix into two part as follows: \[\begin{eqnarray} {\bf C}={\bf A}{\bf B}^{{\top}}, \end{eqnarray}\] where \({\bf A}\in\mathcal{R}^{p\times r}\), \({\bf B}\in\mathcal{R}^{r\times q}\) and known \(r\leq \min(p,q)\). However, this decomposition is not unique, because with a \(r\times r\) nonsingular matrix \({\bf Q}, {\bf C}={\bf A}{\bf B}^{{\top}}={\bf A}{\bf Q}{\bf Q}^{-1}{\bf B}^{{\top}}=\tilde{{\bf A}}\tilde{{\bf B}}^{{\top}}.\) In order to indentify it, we further decompose \({\bf A}\). \({\bf A}^{{\top}}=[{\bf I}_r, {{\bf A}^*}^{{\top}}]\). The author assumes that \(p({\bf A},{\bf B})\propto\exp\left(-\frac{\tau^2}{2} trace\{ {\bf A}^{{\top}}{\bf A}+{\bf B}^{{\top}}{\bf B}\}\right)\) and \(\sigma^2\sim \mathcal{IG}(\frac{a}{2},\frac{b}{2})\).

Posterior Distribution

Let \(\tilde{{\bf a}}_k\) and \(\tilde{{\bf b}}_k\) denote the \(k\)th column of \({\bf A}\) and \({\bf B}\), respectively. Let \({{\bf a}}_j^{{\top}}\) and \({{\bf b}}_l^{{\top}}\) denote the \(y\)th row of \({\bf A}\) and \(l\)th row of \({\bf B}\), respectively.

\[\begin{eqnarray*} &&p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e)\\ &\propto& |{\boldsymbol \Sigma}_e|^{-\frac{q}{2}}\exp\left(-\frac{1}{2}trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}{\boldsymbol \Sigma}_e^{-1}\}\right)\\ &=& (\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}\}\right)\\ &=&(\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)}{\bf B}^{{\top}}) ({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)}{\bf B}^{{\top}})^{{\top}}\}\right)\\ &\times&\exp\left(-\frac{1}{2\sigma^2}trace\{(\tilde{{\bf x}}_j{\bf a}_j^{{\top}}{\bf B}^{{\top}}{\bf B}{\bf a}_j\tilde{{\bf x}}_j^{{\top}}) \}\right)\\ &\times&\exp\left(\frac{1}{\sigma^2}trace\{(\tilde{{\bf x}}_j{\bf a}_j^{{\top}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}})\}\right)\\ &=&(\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)}{\bf B}^{{\top}}) ({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)}{\bf B}^{{\top}})^{{\top}}\}\right)\\ &\times&\exp\left(-\frac{1}{2\sigma^2}{\bf a}_j^{{\top}}{\bf B}^{{\top}}{\bf B}{\bf a}_j\tilde{{\bf x}}_j^{{\top}}\tilde{{\bf x}}_j\right) \times \exp\left((-2)\frac{-1}{2\sigma^2}{\bf a}_j^{{\top}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}}\tilde{{\bf x}}_j\right). \end{eqnarray*}\]

\[\begin{eqnarray*} &&p({\bf a}_j|{\bf A}_{(j)},{\bf B},{\boldsymbol \Sigma}_e,{\bf Y})\\ &\propto& p({\bf A}|{\bf B},{\boldsymbol \Sigma}_e,{\bf Y})\propto p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e) p({\bf A},{\bf B})\\ &\propto& p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e) \exp \left(-\frac{\tau^2}{2} trace\{ {\bf A}^{{\top}}{\bf A}+{\bf B}^{{\top}}{\bf B}\}\right) \\ &\propto& p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e) \exp \left(-\frac{\tau^2}{2} \sum_{j=1}^{r}{\bf a}_j^{{\top}}{\bf a}_j\right)\\ &\propto& \exp\left(-\frac{1}{2\sigma^2}{\bf a}_j^{{\top}}{\bf B}^{{\top}}{\bf B}{\bf a}_j\tilde{{\bf x}}_j^{{\top}}\tilde{{\bf x}}_j\right) \times \exp\left((-2)\frac{-1}{2\sigma^2}{\bf a}_j^{{\top}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}}\tilde{{\bf x}}_j\right)\\ &\times& \exp \left(-\frac{\tau^2}{2}{\bf a}_j^{{\top}}{\bf a}_j\right)\\ &=& \exp\left\{-\frac{1}{2}\left({\bf a}_j^{{\top}}(\sigma^{-2}{\bf B}^{{\top}}{\bf B}\tilde{{\bf x}}_j^{{\top}}\tilde{{\bf x}}_j +{\bf I}_r\tau^2){\bf a}_j-2{\bf a}_j^{{\top}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}}\tilde{{\bf x}}_j\sigma^{-2}\right)\right\}\\ &=& \exp\left\{-\frac{1}{2}\left({\bf a}_j^{{\top}}{{\boldsymbol \Sigma}_j^{A}}^{-1} {\bf a}_j-2{\bf a}_j^{{\top}}{{\boldsymbol \Sigma}_j^{A}}^{-1}{{\boldsymbol \Sigma}_j^{A}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}}\tilde{{\bf x}}_j\sigma^{-2}\right)\right\}\\ &=& \exp\left\{-\frac{1}{2}\left({\bf a}_j^{{\top}}{{\boldsymbol \Sigma}_j^{A}}^{-1} {\bf a}_j-2{\bf a}_j^{{\top}}{{\boldsymbol \Sigma}_j^{A}}^{-1}{\boldsymbol \mu}_j^{A}\right)\right\}\\ &\propto& \exp\left\{-\frac{1}{2}\left(({\bf a}_j-{\boldsymbol \mu}_j^{A})^{{\top}}{{\boldsymbol \Sigma}_j^{A}}^{-1}({\bf a}_j-{\boldsymbol \mu}_j^{A}) \right)\right\}, \end{eqnarray*}\] where \({\boldsymbol \Sigma}_j^{A}={({\bf B}^{{\top}}{\bf B}\tilde{{\bf x}}_j^{{\top}}\tilde{{\bf x}}_j\sigma^{-2} +{\bf I}_r\tau^2)}^{-1}\) and \({\boldsymbol \mu}_j^{A}={{\boldsymbol \Sigma}_j^{A}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}}\tilde{{\bf x}}_j\sigma^{-2}\).

Hence, \({\bf a}_j|{\bf A}_{(j)},{\bf B},{\boldsymbol \Sigma}_e,{\bf Y}\sim \mathcal{N}_r({\boldsymbol \mu}_j^A,{\boldsymbol \Sigma}_j^A)\). \[\begin{eqnarray*} &&p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e)\\ &\propto& |{\boldsymbol \Sigma}_e|^{-\frac{q}{2}}\exp\left(-\frac{1}{2}trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}{\boldsymbol \Sigma}_e^{-1}({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})\}\right)\\ &=& (\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})\}\right)\\ &=&(\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{({\bf Y}_{(\tilde{l})} -{\bf X}{\bf A}({\bf B}^{{\top}})_{(\tilde{l})})^{{\top}}({\bf Y}_{(\tilde{l})}-{\bf X}{\bf A}({\bf B}^{{\top}})_{(\tilde{l})})\}\right)\\ &\times& \exp\left(-\frac{1}{2\sigma^2}\left[{\bf b}_{l}^{{\top}} ({\bf X}{\bf A})^{{\top}}{\bf X}{\bf A}{\bf b}_{l}- 2{\bf b}_{l}^{{\top}}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l+\tilde{{\bf y}}_l^{{\top}}\tilde{{\bf y}}_l\right]\right). \end{eqnarray*}\]

\[\begin{eqnarray*} &&p({\bf b}_l|{\bf A},({\bf B}^{{\top}})_{(\tilde{l})},{\bf Y},{\boldsymbol \Sigma}_e)\\ &\propto& p({\bf B}|{\bf A},{\bf Y},{\boldsymbol \Sigma}_e) \propto p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e)p({\bf A},{\bf B})\\ &\propto & p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e) \exp \left(-\frac{\tau^2}{2} trace\{ {\bf A}^{{\top}}{\bf A}+{\bf B}^{{\top}}{\bf B}\}\right) \\ &\propto& p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e) \exp \left(-\frac{\tau^2}{2} \sum_{l=1}^{r}{\bf b}_l^{{\top}}{\bf b}_l\right)\\ &\propto& \exp\left(-\frac{1}{2\sigma^2}\left[{\bf b}_{l}^{{\top}} ({\bf X}{\bf A})^{{\top}}{\bf X}{\bf A}{\bf b}_{l}- 2{\bf b}_{l}^{{\top}}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l+\tilde{{\bf y}}_l^{{\top}}\tilde{{\bf y}}_l\right]\right) \exp \left(-\frac{\tau^2}{2}{\bf b}_l^{{\top}}{\bf b}_l\right)\\ &=& \exp\left(-\frac{1}{2}\left[{\bf b}_{l}^{{\top}} ({\bf X}{\bf A})^{{\top}}{\bf X}{\bf A}{\bf b}_{l}\sigma^{-2}- 2{\bf b}_{l}^{{\top}}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l\sigma^{-2}+ \tilde{{\bf y}}_l^{{\top}}\tilde{{\bf y}}_l\sigma^{-2}\right]\right) \exp \left(-\frac{\tau^2}{2}{\bf b}_l^{{\top}}{\bf b}_l\right)\\ &=& \exp\left\{-\frac{1}{2}\left({\bf b}_{l}^{{\top}}( ({\bf X}{\bf A})^{{\top}}{\bf X}{\bf A}\sigma^{-2}+{\bf I}_r\tau^2){\bf b}_{l}- 2{\bf b}_{l}^{{\top}}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l\sigma^{-2}\right)\right\}\\ &=& \exp\left\{-\frac{1}{2}\left({\bf b}_{l}^{{\top}}{{\boldsymbol \Sigma}_j^B}^{-1}{\bf b}_{l}- 2{\bf b}_{l}^{{\top}}{{\boldsymbol \Sigma}_j^B}^{-1}{{\boldsymbol \Sigma}_j^B}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l\sigma^{-2}\right)\right\}\\ &=& \exp\left\{-\frac {1}{2}\left({\bf b}_{l}^{{\top}}{{\boldsymbol \Sigma}_j^B}^{-1}{\bf b}_{l}- 2{\bf b}_{l}^{{\top}}{{\boldsymbol \Sigma}_j^B}^{-1}{\boldsymbol \mu}_j^B\right)\right\}\\ &\propto& \exp\left\{-\frac{1}{2}\left(({\bf b}_{l}-{\boldsymbol \mu}_j^B)^{{\top}}{{\boldsymbol \Sigma}_j^B}^{-1}({\bf b}_{l}-{\boldsymbol \mu}_j^B) \right)\right\}, \end{eqnarray*}\] where \({\boldsymbol \Sigma}_j^B= (({\bf X}{\bf A})^{{\top}}{\bf X}{\bf A}\sigma^{-2}+{\bf I}_r\tau^2)^{-1}\) and \({\boldsymbol \mu}_j^B={{\boldsymbol \Sigma}_j^B}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l\sigma^{-2}\).

Hence, \({\bf b}_l|{\bf A},({\bf B}^{{\top}})_{(\tilde{l})},{\bf Y},{\boldsymbol \Sigma}_e\sim \mathcal{N}_r({\boldsymbol \mu}_j^B,{\boldsymbol \Sigma}_j^B)\).

We know that the element in \(k\)th row and \(k\)th column of \({\boldsymbol \Sigma}_e\) is \(\sigma^2, k=1,2,\ldots,q\). \[\begin{eqnarray*} &&p(\sigma^2|{\bf A},{\bf B},{\bf Y},({\boldsymbol \Sigma}_e)_{(k)(\tilde{k})})\\ &\propto& p({\boldsymbol \Sigma}_e|{\bf A},{\bf B},{\bf Y})\propto p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e)p({\boldsymbol \Sigma}_e)\propto p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e)p(\sigma^2)\\ &=& (\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}\}\right)\\ &\times & (\sigma^2)^{-\frac{a}{2}-1}\exp\left(-\frac{b}{2\sigma^2}\right)\\ &=& (\sigma^2)^{-\frac{nq+a}{2}-1}\exp\left(-\frac{1}{2\sigma^2}(trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}\}+b)\right). \end{eqnarray*}\] Hence, \(\sigma^2|{\bf A},{\bf B},{\bf Y},({\boldsymbol \Sigma}_e)_{(k)(\tilde{k})}\sim \mathcal{IG}\left(\frac{nq+a}{2},\frac{1}{2}(trace\{({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}\}+b)\right)\).

Gibbs Sampling

The algorithm is easy to construct. Begin with initial values \({\bf A}^{(0)}, {\boldsymbol \Sigma}^{(0)}\), then for t = 1,2,…

  • Step (1) Given \({\bf A}_{(j)}^{(t-1)},{\bf B}^{{\top}(t-1)}, {\sigma^2}^{(t-1)}\), draw \({\bf a}_j^{(t)}\) from \(\mathcal{N}_r({\boldsymbol \mu}_j^A,{\boldsymbol \Sigma}_j^A),j=r+1,r+2,\ldots,p\);
  • Step (2) Given \({\bf A}^{(t)},{\bf B}^{{\top}(t-1)}_{(\tilde{l})}, {\sigma^2}^{(t-1)}\), draw \({\bf b}_l^{(t)}\) from \(\mathcal{N}_r({\boldsymbol \mu}_j^B,{\boldsymbol \Sigma}_j^B),l=1,2,\ldots,q\);
  • Step (3) Given \({\bf B}^{(t)}, {\bf A}^{(t)}\), draw \({\sigma^2}^{(t)}\) from \(\mathcal{IG}\left(a^*,b^*\right)\).
  • Step (4) Repeat step (1) to step (3) until convergence.

Note that, the step (1) samples the \({\bf a}_j\) from \(j=r+1\), because first part of \({\bf A}\) is \({\bf I}_r\), which is known. Hence, step (1) samples the \({\bf A}^*\) and then insert \({\bf I}_r\) together to construct \({\bf A}\).

Model Selection

To this point we have proceeded as if r were known. In most applications this will not be true and so analysis to this point is conditional.When r is unknown, the analysis may be carried out for several alternative values of r. Under the Bayesian framework, model selection can be implemented by using the model posterior probability conditioning the data, \[ p(M_r|y)=\frac{p(y|M_r)p(M_r)}{\sum_{r\in \mathcal{M}}p(y|M_r)p(M_r)}= \frac{p(y|M_r)}{\sum_{r\in \mathcal{M}}p(y|M_r)}, \] where \(\mathcal{M}={1,2,\ldots,min(p,q)}\), and \(p(M_r)=\frac{1}{card(\mathcal{M})}\).

Hence, as the prior of rank(model) is flat, \(p(M_r|y)\) is only determined by marginal likelihood (\(p(y|M_r)\)). The model selection problem here is then converted to find out the rank which maximizes the marginal likelihood (\(p(y|M_r)\)). In order to calculate the \(p(y|M_r)\), I used Laplace and Gelfand and Dey (GD) methods, and then compare with DIC.

Laplace

Let \({\boldsymbol \theta}= ({\bf A},{\bf B},\sigma^2)\) \[\begin{eqnarray*} p({\bf Y}|M_r) &=& \int \int \int p({\bf Y}|{\boldsymbol \theta},M)p({\boldsymbol \theta}) d{{\bf A}}d{{\bf B}}d{\sigma^2}\\ &\approx& p({\bf Y}|\hat{{\boldsymbol \theta}},M)p(\hat{\boldsymbol \theta}|M)|(nq)^{-1}\hat{\boldsymbol \Sigma}_M|^{1/2} (2\pi)^{(k_M/2)}, \end{eqnarray*}\] where \((\hat{\bf A},\hat{\bf B},\hat\sigma^2)=\arg \max p({\bf Y}|{{\bf A}},{{\bf B}},{\sigma^2},M)p({\bf A},{\bf B},\sigma^2|M)\).

Hence, \[\begin{eqnarray} &&\log p({\bf Y}|M_r) \nonumber\\ &\approx& \log p({\bf Y}|\hat{{\boldsymbol \theta}},M) + \log p(\hat{\boldsymbol \theta}|M) -\frac{1}{2}k_M\log nq +\frac{1}{2}|{\boldsymbol \Sigma}_M| + \frac{k_M}{2}\log 2\pi\nonumber\\ &=&-\frac{1}{2} \left(-2\log(p({\bf Y}|\hat{{\bf A}},\hat{{\bf B}},\hat{\sigma^2},M)) + k_M \log nq\right) +C\nonumber\\ &=&-\frac{1}{2}(nq\log(2\pi\hat{\sigma^2})- \frac{1}{2\hat{\sigma^2}} trace\left\{({\bf Y}-{\bf X}\hat{\bf C})^{{\top}}({\bf Y}-{\bf X}\hat{\bf C}) \right\} \nonumber\\ &&+ [r(p+q-r)+1] \log(nq))+C\nonumber\\ &=& -\frac{1}{2}BIC + C\nonumber\\ &=& -\frac{1}{2}BIC, \text{as } n\rightarrow \infty. \tag{2} \end{eqnarray}\]

The problem we are facing when we use Laplace here is that we need to find the mode of \(p({\bf A},{\bf B},\sigma^2|{\bf Y})\), in which \(({\bf A},{\bf B},\sigma^2)\) is high-dimensional. To address this problem, Besag proposed an iterated conditional modes (ICM) algorithm. The ICM obtains the local maximum of the joint posterior by iteratively maximizing the full conditionals as follows:

  • Begin with initial values \({\bf A}^{(0)},{\boldsymbol \Sigma}^{(0)}\), then for t = 1,2,…
  • Given \({\bf A}_{(j)}^{(t-1)},{\bf B}^{{\top}(t-1)}, {\sigma^2}^{(t-1)}\), \({\bf a}_j^{(t)}\leftarrow{\boldsymbol \mu}_j^A,j=r+1,r+2,\ldots,p\);
  • Given \({\bf A}^{(t)},{\bf B}^{{\top}(t-1)}_{(\tilde{l})}, {\sigma^2}^{(t-1)}\), \({\bf b}_l^{(t)}\leftarrow{\boldsymbol \mu}_j^B,l=1,2,\ldots,q\);
  • Given \({\bf B}^{(t)}, {\bf A}^{(t)}\), \({\sigma^2}^{(t)}\leftarrow\frac{b^*}{a^*-1}\).

We obtain \({\bf C}^{(t)}={\bf A}^{(t)}{\bf B}^{{\top}(t)}\) and estimate \(\hat{{\bf C}}_{ICM}=T^{-1}\sum_{t=1}^{T}{\bf C}^{(t)}\). Put the \(\hat{{\bf C}}_{ICM}\) in the Eq.(2) to calculate \(-\frac{1}{2}BIC\). Hence, the Laplace here is actually propotional to BIC.

DIC

Let \({\boldsymbol \theta}=({\bf A},{\bf B},\sigma^2)\), \[\begin{eqnarray*} &&D({\boldsymbol \theta})\\ &=&-2\log p({\bf Y}|{\boldsymbol \theta},M) \\ &=&nq\log(2\pi\sigma^2)+ \frac{1}{\sigma^2} trace\left\{({\bf Y}-{\bf X}{\bf C})^{{\top}}({\bf Y}-{\bf X}{\bf C}) \right\} \end{eqnarray*}\]

Then DIC can be computed by \[ DIC \approx 2T^{-1}\sum_{t=1}^{T}D({\boldsymbol \theta}^{(t)})-D(T^{-1}\sum_{t=1}^{T}{\boldsymbol \theta}^{(t)}), \] where \({\boldsymbol \theta}^{(t)}\) is a MCMC sample generated from \(p({\boldsymbol \theta}|{\bf Y})\).

Note that \(DIC=D(\bar{{\boldsymbol \theta}})+2P_D\), which is analogous to AIC.

Gelfand & Dey (GD)

Let \({\boldsymbol \theta}=({\bf A},{\bf B},\sigma^2)\),then the GD estimator is \[p({\bf Y}|M)\approx\left[T^{-1}\sum_{t=1}^{T}\frac{g({\boldsymbol \theta}^{(t)})}{p({\bf Y}|{\boldsymbol \theta}^{(t)})p({\boldsymbol \theta}^{(t)})}\right]^{-1},\] where \({\boldsymbol \theta}^{(t)}\) is a MCMC sample generated from \(p({\boldsymbol \theta}|{\bf Y})\). Define \(g({\boldsymbol \theta})=N({\boldsymbol \theta}|\tilde{{\boldsymbol \theta}},\tilde{{\boldsymbol \Sigma}})\), where \(\tilde{{\boldsymbol \theta}}\) and \(\tilde{{\boldsymbol \Sigma}}\) are MCMC sample mean and variance, respectively. I use formular as follows: \[\begin{eqnarray} \log p({\bf Y}|M)&\approx& \log \left[\frac{1}{T}\sum_{t=1}^{T}\frac{g^{(t)}}{f^{(t)}}\right]^{-1}\nonumber\\ &=&\log \left[\frac{T}{\sum_{t=1}^{T}\frac{g^{(t)}}{f^{(t)}}}\right]\nonumber\\ &=&\log T - \log \left(\sum_{t=1}^{T}\frac{g^{(t)}}{f^{(t)}}\right)\nonumber\\ &=&\log T - \log \left(\sum_{t=1}^{T}\exp (\log g^{(t)}-\log f^{(t)})\right)\tag{3}\\ &=&\log T - \log \left(\sum_{t=1}^{T}\exp c_t\right)\nonumber\\ &=&\log T - \log \left(e^{ c_{1} }e^{\sum_{t=1}^{T} (c_t-c_{1})}\right)\nonumber\\ &=&\log T - \left[c_1 + \log \left(\sum_{t=2}^{T} (c_t-c_{1})\right)\right]\nonumber, \end{eqnarray}\] where \(g^{(t)}=g({\boldsymbol \theta}), f^{(t)}=p({\bf Y}|{\boldsymbol \theta}^{(t)})p({\boldsymbol \theta}^{(t)}), c_t = \log g^{(t)}-\log f^{(t)}\).

Note that when calculating \(\log p({\bf Y}|M)\), there is a computation problem in formula (3), \(\exp (\log g^{(t)}-\log f^{(t)})\) goes to infinity, due to a very large magnitude of \(\log g^{(t)}-\log f^{(t)}\). To solve this problem, I used \(\log T - \left[c_1 + \log \left(\sum_{t=2}^{T} (c_t-c_{1})\right)\right]\) to calculate \(\log p({\bf Y}|M)\).

Simulation Study

Data Generation

In the simulation study, my goal is to find out the true model or true rank of coefficient matrix(\({\bf C}\)) based on Laplace, DIC and GD method. I set \(n=100,q=12,p=7\) and \(\sigma^2=2\). The coefficient matrix is as follows:

\[{\bf C}=\left[ \begin{array}{@{}*{12}{c}@{}} 1 & 0 & 0 & 2 & -1 & 0 & 0 & 0 & 0 & 0 & 1 & -1\\ 0 & 1 & 0 & 0 & 0 & -3 & 2 & 0 & 0 & 0 & -1 & 3\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 3 & -3 & 4 & 2 & 2\\ 1 & 0 & 0 & 2 & -1 & 0 & 0 & 0 & 0 & 0 & 1 & -1\\ 0 & 1 & 0 & 0 & 0 & -3 & 2 & 0 & 0 & 0 & -1 & 3\\ 0 & 1 & 0 & 0 & 0 & -3 & 2 & 0 & 0 & 0 & -1 & 3\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 3 &-3 &4 & 2 & 2 \end{array} \right]\] Hence, the true rank of \({\bf C}\) is 3.

Generate data based on the model from Eq.(1). For the prior, \(p({\bf A},{\bf B})\propto\exp\left(-\frac{\tau^2}{2} trace\{ {\bf A}^{{\top}}{\bf A}+{\bf B}^{{\top}}{\bf B}\}\right)\) and \(\sigma^2\sim \mathcal{IG}(\frac{a}{2},\frac{b}{2})\), where \(\tau=10^{-3}\) and \(a=b=1\).

After running the MCMC simulation, the result based on 1 replication is shown in Table 1.

Table 1: Model Selection among Laplace, DIC and GD Based on 1 Replication.
Method(\(\log\)) 1 2 3 4 5 6 7
Laplace -3007 -2535 -2223 -2257 -2285 -2310 -2329
DIC 84300 17779 7302 7317 7325 7363 7405
GD -3062 -2631 -2347 -2397 -2436 -2470 -2495
MSE 1.032 0.186 0.0120 0.014 0.0170 0.0192 0.0217

The MSE in the table is defined as follows: \[ MSE = \frac{trace\{(\hat {\bf C}-{\bf C})^{{\top}}(\hat {\bf C}-{\bf C})\}}{pq}. \] MSE indicates the goodness of fit in the reduced rank regression model. We can observe that the MSE is minimized when the rank is 3. This demonstrates that parameter estimation is good enough. For the Laplace, DIC and GD method, in this case, all of them select the true rank. Given rank\(({\bf C})=3\), the estimated \({\bf C}\) is as follows:

1.07 0.17 0.01 2.34 -0.89 -0.12 -0.11 -0.08 -0.07 0.03 1.26 -1.10
-0.02 1.00 0.04 -0.14 -0.17 -2.83 2.10 0.10 0.03 0.03 -1.04 3.17
0.13 -0.03 0.99 0.18 -0.02 0.08 0.00 3.06 -2.85 4.03 2.10 2.06
0.94 0.14 0.03 2.05 -0.78 -0.08 -0.11 -0.03 -0.10 0.07 1.13 -0.95
0.04 0.99 0.04 -0.00 -0.21 -2.79 2.05 0.10 0.01 0.04 -0.95 3.05
0.12 1.05 0.00 0.18 -0.29 -2.93 2.14 -0.02 0.12 -0.12 -0.98 3.03
0.08 -0.01 0.98 0.08 0.01 0.03 0.04 3.03 -2.81 3.98 2.01 2.14

In addition, when selected rank is larger than 3, the MSE is not worse than I expect. In other words, the model based on larger model (larger rank in \({\bf C}\)), the overfiteed models still have a good estimation for the \({\bf C}\). The values of DIC, Laplace and GD are not far away from the value at true rank. This is very reasonable for overfitted model. But when selected rank smaller than true rank, the MSE increase significantly, and the corresponding values of DIC, Laplace and GD are far away from the value at true rank. This means the model miss some important variables. The analysis is based on one replication.

In Table 2, the result is based on 1000 replication. I want to see the perfermance of methods for model selection. In Table 2, Laplace and GD always select true rank, however, the DIC tends to select larger model, because the average of selected rank in DIC is about 3.5, whereas, that of Laplace and GD are 3. Besides, for the successful probability of selecting true rank, the Laplace and GD, of course, is 100%, but is 61.7% for DIC. This makes sense because Laplace and GD are approximated and exact calculation for marginal likehihood, respectively. They work well in model fitting. The DIC is analogous to AIC, which tends to select larger model and better for short-term prediction.

Table 2: Comparison among Laplace, DIC, GD Based on 1000 Replication.
Method Mean of Selected rank Selection Probabiliy
Laplace 3 1
DIC 3.525 0.617
GD 3 1

R Code for Bayeian reduced rank regression with DIC used in the simulation study

library(mvtnorm)
library(invgamma)
rm(list=ls())
## Data & Model##
row1 <- c(1,0,0,2,-1,0,0,0,0,0,1,-1)
row2 <- c(0,1,0,0,0,-3,2,0,0,0,-1,3)
row3 <- c(0,0,1,0,0,0,0,3,-3,4,2,2)
C <- rbind(row1,row2,row3,row1,row2,row2,row3)
true.rank <- 3
p <- dim(C)[1]
q <- dim(C)[2]
true.sig2 <- 2
true.SIGe <- diag(true.sig2,q)
n=100
a=b=1
tau2=1e-3
MC.size=5000
burn_in=3000
reptn <- 125
error <- array(NA,dim = c(reptn,p))
DIC <- array(NA,dim = c(reptn,p))
TPR.DIC <- rep(NA,reptn)
TPR.error <- rep(NA,reptn)
hat.rank <- rep(NA,reptn)
v=0
for (i in 1:reptn) {
  t1=Sys.time()
  set.seed(2144+i+125*v)
  X <- matrix(rnorm(n*p,0,1),n,p)
  E <- rmvnorm(n, rep(0,q), true.SIGe, method="chol")
  Y=X%*%C+E
  ## Reduced Rank Regression model ##
  for (r in 1:p) {
    Hat.A <- array(NA,dim = c(p,r,MC.size))
    Hat.B <- array(NA,dim = c(q,r,MC.size))
    Hat.C <- array(NA,dim = c(p,q,MC.size))
    Hat.sig2 <- rep(NA,MC.size)
    # initial values
    hat.C <- coef(lm(Y~X-1))
    if (r==1){
      hat.B <- as.matrix(hat.C[1:r,])
    } else {
      hat.B <- t(hat.C[1:r,])
    }
    if (r==p) {
      hat.A <- diag(1,r)
    } else {
      hat.A <- rbind(diag(1,r),hat.C[(r+1):p,]%*%hat.B%*%solve(crossprod(hat.B)))
    }

    hat.sig2 <- mean(diag(tcrossprod(Y-X%*%hat.A%*%t(hat.B))))  #true.sig2
    ## MCMC ##
    for (jin in 1:MC.size) {
      # Sampling A
      if (r==p){
        Hat.A[,,jin] <- hat.A
      } else {
        for(j in (r+1):p) {
          Sig.A_j <- solve(crossprod(hat.B)*as.numeric(crossprod(X[,j]))/hat.sig2+diag(tau2,r))
          mu.A_j <- Sig.A_j%*%t(hat.B)%*%t(Y-X[,-j]%*%hat.A[-j,]%*%t(hat.B))%*%X[,j]/hat.sig2
          hat.a_j <- rmvnorm(n=1,mean = mu.A_j,sigma = Sig.A_j)
          hat.A[j,] <- hat.a_j
        }
        Hat.A[,,jin] <- hat.A
      }
      # Sampling B
      Sig.B_l <- solve(crossprod(X%*%hat.A)/hat.sig2+diag(tau2,r))
      for(l in 1:q) {
        mu.B_l <- Sig.B_l%*%t(X%*%hat.A)%*%Y[,l]/hat.sig2
        hat.b_l <- t(rmvnorm(n=1,mean = mu.B_l,sigma = Sig.B_l))
        hat.B[l,] <- t(hat.b_l)
      }
      Hat.B[,,jin] <- hat.B
      Hat.C[,,jin] <- hat.A%*%t(hat.B)
      #Sampling Sig2
      hat.sig2 <- rinvgamma(n = 1,shape = (q*n+a)/2,rate = 0.5*(b+sum(diag(tcrossprod(Y-X%*%hat.A%*%t(hat.B))))))
      Hat.sig2[jin] <- hat.sig2
      #print(jin)
    }
    hat.C <- apply(Hat.C[,,-(1:burn_in)],c(1,2),mean)
    hat.sig2 <- mean(Hat.sig2[-(1:burn_in)])
    # print(C)
    D.theta <- rep(NA,MC.size-burn_in)
    j=0
    for (jin in (burn_in+1):MC.size) {
      j=j+1
      D.theta[j] <- n*q*log(2*pi*Hat.sig2[jin])+Hat.sig2[jin]*sum(diag(crossprod(Y-X%*%Hat.C[,,jin])))
    }
    D.hat.theta <- n*q*log(2*pi*hat.sig2)+hat.sig2*sum(diag(crossprod(Y-X%*%hat.C)))
    DIC[i,r] <- 2*mean(D.theta)- D.hat.theta
    error[i,r] <- sum((hat.C-C)^2)/(p*q)
  }
  hat.rank[i] <- which.min(DIC[i,])
  TPR.DIC[i] <- hat.rank[i]==true.rank
  TPR.error[i] <- which.min(error[i,])==true.rank
  print(c(i,hat.rank[i]))
  print(Sys.time()-t1)
}

data.frame(hat.rank=mean(hat.rank),TPR.DIC=mean(TPR.DIC),TPR.ERR=mean(TPR.error))
save(list = ls(), file = paste0('DIC-',v,'.RData'))

Reference

Geweke, John F. 1996. “Bayesian Reduced Rank Regression in Econometrics.” Journal of Econometrics 75 (November). https://doi.org/10.1016/0304-4076(95)01773-9.