Model Selection Criteria

Caleb Jin / 2019-04-30

Prerequisites
- Bias-variance tradeoff
- Bias–variance decomposition of mean squared error (MSE):
1. The coefficient of determination (\(R^2\))
- \(R^2\)
- Ajusted \(R^2\)
2. Bayesian information criterion (BIC)
- Laplace’s approximation
- Flat Prior and the Weak Law of Large Numbers
Appendix
- Laplace’s method

Prerequisites

Consider a multiple linear regression model as follows: \[\begin{eqnarray*} {\bf y}={\bf X}{\boldsymbol \beta}+ {\boldsymbol \epsilon}, \end{eqnarray*}\] where \({\bf y}=(y_1,y_2,\dots,y_n)^{{\bf T}}\) is the \(n\)-dimensional response vector, \({\bf X}=({\bf x}_1,{\bf x}_2, \dots, {\bf x}_n)^{{\bf T}}\) is the \(n\times p\) design matrix, and \({\boldsymbol \epsilon}\sim \mathcal{N}_n({\boldsymbol 0},\sigma^2{\bf I}_n)\). We assume that \(p<n\) and \({\bf X}\) is full rank.

By the method of MLE, we have \[\begin{eqnarray*} &&\hat{\boldsymbol \beta}=({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}{\bf y}\\ &&{\hat\sigma}^2 = \frac{RSS}{n}=\frac{||{\bf y}-{\bf X}\hat{\boldsymbol \beta}||^2}{n} = \frac{{\bf y}^{{\bf T}}({\bf I}-{\bf H}){\bf y}}{n} = \frac{{\bf y}^{{\bf T}}{\bf P}{\bf y}}{n}, \end{eqnarray*}\] where \({\bf P}= {\bf I}-{\bf H}; {\bf H}= {\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}\) and \(RSS\) is residuals sum of squares.

Bias-variance tradeoff

According to wiki:

In statistics and machine learning, the bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.

Models with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials) but may produce lower variance predictions when applied beyond the training set.

Bias–variance decomposition of mean squared error (MSE):

We assume \({\bf y}= f(x) + \varepsilon\), where \(\mathbb{E}(\varepsilon)=0\) and \(\text{Var}(\varepsilon)=\sigma^2\). Our goal is to find a function \(\hat f(x)\) that makes MSE of \(\hat f\), \(\mathbb{E}\{({\bf y}-\hat f)^{{\bf T}}({\bf y}-\hat f)\}\), minimum.

The Bias-Variance decomposition of MSE proceeds as follows: \[\begin{eqnarray*} &&\mathbb{E}\{({\bf y}-\hat f)^{{\bf T}}({\bf y}-\hat f)\} = \{\mathbb{E}({\bf y}-\hat f)\}^{{\bf T}}\mathbb{E}({\bf y}-\hat f) + \text{Var}({\bf y}-\hat f)\\ &=&||\text{Bias}(\hat f)||^2 + \text{Var}({\bf y})+\text{Var}(\hat f) - 2\text{cov}({\bf y},\hat f)\\ &=& ||\text{Bias}(\hat f)||^2 +\text{Var}(\hat f) + \sigma^2, \end{eqnarray*}\] where \[\begin{eqnarray*}\text{cov}({\bf y},\hat f) &=& \mathbb{E}({\bf y}\hat f) - \mathbb{E}({\bf y})\mathbb{E}(\hat f)\\ &=& \mathbb{E}[(f+\varepsilon)\hat f] - \mathbb{E}(f+\varepsilon)\mathbb{E}(\hat f)\\ &=& f\mathbb{E}(\hat f) + \mathbb{E}(\varepsilon\hat f) - f\mathbb{E}(\hat f)\\ &=&\mathbb{E}(\varepsilon\hat f)\\ &=&0, \end{eqnarray*}\] since \(\varepsilon \bot \hat f\) or they are independent. (Question : why independent implies \(\varepsilon \bot \hat f\), which implies \(\mathbb{E}(\varepsilon\hat f)=0\)).

Bias-variance tradeoff

models including many covariates leads to have low bias but high variance.
models including few covariates leads to high bias but low variance.

Hence, we need criteria that both take in account model complexity (number of predictors) and quality of fit.

1. The coefficient of determination (\(R^2\))

Summary: it is not a good criterion because \(R^2\) increases with the size of model; in other words, it always choose biggest model.

Interpretation by wiki: It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

\(R^2\)

Denifition: \[\begin{eqnarray} \text{R}^2 = 1-\frac{RSS}{TSS} = 1- \frac{\sum_i(y_i-\hat{f_i})^2}{\sum_i(y_i-\bar y)^2}, \end{eqnarray}\] where TSS is total sum of squares, RSS is residual sum of squares. And define \(\text{ESS} = \sum_i(\hat f - \bar y)^2\) is explained sum of squares, also called the regression sum of square. \(R^2\) is based on the assumption that \(TSS = RSS + ESS\). Under the linear regression model setting satisfies this assumption usually.

Proof:

\[\begin{eqnarray*} &&\sum_i(y_i-\bar y)^2 = \sum(y_i-\hat{f_i}+ \hat{f_i} - \bar y)^2 \\ &=&\sum_i(y_i-\hat{f_i})^2 + \sum_i(\hat{f_i} - \bar y)^2 + 2\sum_i(y_i-\hat{f_i})(\hat{f_i} - \bar y)\\ &=& RSS + ESS + 2\sum_i\hat{e}_i(\hat{f_i} - \bar y) \,(\hat{f_i}=\hat{y_i}={\bf X}\hat{{\boldsymbol \beta}}\enspace\text{in linear model}) \\ &=& RSS + ESS + 2\sum_i\hat{e}_i(\hat{y_i} - \bar y)\\ &=& RSS + ESS + 2\sum_i\hat{e}_i\hat{y_i}-2\bar y\sum_i\hat{e}_i \end{eqnarray*}\] Then, the reamining part is to prove \(\sum_i\hat{e}_i(\hat{y_i} - \bar y)=0\).

Firstly, \(\sum_i\hat{e}_i\hat{y_i} = {\bf e}^{{\bf T}}{\bf H}{\bf y}= {\bf y}^{{\bf T}}({\bf I}-{\bf H}){\bf H}{\bf y}= 0\) due to \({\bf H}\) idempotent. Then if we can show \(\sum_i \hat{e}_i=0\), our proof is done. However, this can not be shown for a model without an intercept.

\(R^2\) in the model with an intercept

To see this, the partial derivative of our normal equation w.r.t \(\beta_0\) is: \[ \frac{\partial ESS}{\partial\hat\beta_0} = \frac{\sum_i(y_i-\hat\beta_0-x_i\hat\beta_1)^2}{\partial\hat\beta_0} = -2\sum_i(y_i-\hat\beta_0-x_i\hat\beta_1)=0, \] which can be rearranged to \(\sum_iy_i = \sum_i\hat\beta_0+\hat\beta_1\sum_ix_i=\sum_i\hat y_i\). Thus, \(\sum\hat e_i = \sum_iy_i - \sum_i\hat y_i = 0\).

Hence in a model with intecept, we have that \(TSS = RSS + ESS\) of that \(1 = \frac{RSS}{TSS} + \frac{ESS}{TSS}\).

From this \(R^2\) is defined as \(R^2\overset{def}{=}1-\frac{RSS}{TSS}\).

By the above, \(R^2\geq0\)

\(R^2\) in the model without an intercept

\(R^2\overset{def}{=}1-\frac{RSS}{TSS} = \frac{ESS+2\sum_i(y_i-\hat{y_i})(\hat{y_i} - \bar y)}{\sum_i(y_i-\bar y)^2}\). If the second term of numerator is large positive value, then \(R^2\) can be larger than 1 or it is a small negative value, then \(R^2\) can be negative.

Inflation \(R^2\)

\(\max_{{\boldsymbol \beta}}R^2 = \min_{{\boldsymbol \beta}}RSS\). As \(RSS\) is decreasing with increases in the number of regressors, the \(R^2\) will weakly increase with addtional explanatory variables.

Ajusted \(R^2\)¹

Define adusted \(R^2\) as

\(R_a^2 = 1-\frac{RSS/df_e}{TSS/df_t} = 1-\frac{RSS/(n-p-1)}{TSS/(n-1)}\) = \(1-\frac{Var(\hat\sigma^2)}{Var_{tot}}\), where \(Var(\hat\sigma^2)\) is the sample variance of the estimated residuals and \(Var_{tot}\) is sample variance of dependent variable. They can be considered as unbiased estimates of the population variances of the errors and of the dependent variable. Hence, adjusted \(R^2\) can be interpreted as an unbiased estimator of the population \(R^2\).

2. Bayesian information criterion (BIC)

This is the derivation of the Bayesian information criterion (BIC) for the model selection. The main content following refers to the note from Dr. Bhat ².

Laplace’s approximation

Define an index set of the active predictors \({\boldsymbol \gamma}= \{j: \beta_j \neq 0\}\) for \(j = 1, 2, \ldots, p\). So \({\boldsymbol \gamma}\) can be treated as a model we consider. The Bayesian approach to the model selection is to maximize the posterior distribution of a model given the data \(\pi({\boldsymbol \gamma}|{\bf y})\). By Bayes theorem, \(\pi({\boldsymbol \gamma}|{\bf y})\propto f({\bf y}|{\boldsymbol \gamma})\pi({\boldsymbol \gamma})\), where \[ f({\bf y}|{\boldsymbol \gamma}) = \int f({\bf y}|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\pi({\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma})d{\boldsymbol \theta}_{\boldsymbol \gamma}= \int \exp\left[\log\left( f({\bf y}|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\pi({\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma})\right)\right]d{\boldsymbol \theta}_{\boldsymbol \gamma}. \] We expand \(\log\left( f({\bf y}|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\pi({\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma})\right)\) by Taylor expansion about its posterior mode \(\hat {\boldsymbol \theta}_{\boldsymbol \gamma}\) where \(f({\bf y}|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\pi({\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma})\) attains the maximimum. Thus, \[ \log\left( f({\bf y}|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\pi({\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma})\right) \approx \log\left( f({\bf y}|\hat{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\pi(\hat{\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma})\right) + ({\boldsymbol \theta}_{\boldsymbol \gamma}- \hat{\boldsymbol \theta}_{\boldsymbol \gamma})\nabla_{\boldsymbol \theta}Q|_{\hat{\boldsymbol \theta}} + \frac{1}{2}({\boldsymbol \theta}_{\boldsymbol \gamma}- \hat{\boldsymbol \theta}_{\boldsymbol \gamma})^{{\bf T}}{\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}({\boldsymbol \theta}_{\boldsymbol \gamma}- \hat{\boldsymbol \theta}_{\boldsymbol \gamma}), \] where \(Q_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}} = \log\left( f({\bf y}|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\pi({\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma})\right)\) and \({\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}\) is a Hessian matrix such that \({\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}} = \frac{\partial^2Q}{\partial{\boldsymbol \theta}^{{\bf T}}\partial{\boldsymbol \theta}}|_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}\). Note that \(\nabla_{\boldsymbol \theta}Q_{\hat{\boldsymbol \theta}}=0\) and \({\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}\) is negative definite at \(\hat{\boldsymbol \theta}_{\boldsymbol \gamma}\).

Therefore, \[\begin{eqnarray*} f({\bf y}|{\boldsymbol \gamma}) &=& \int \exp\left[\log\left( f({\bf y}|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\pi({\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma})\right)\right]d{\boldsymbol \theta}_{\boldsymbol \gamma}\\ &\approx& \int \exp\left\{Q_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}} - \frac{1}{2}({\boldsymbol \theta}_{\boldsymbol \gamma}- \hat{\boldsymbol \theta}_{\boldsymbol \gamma})^{{\bf T}}\tilde{\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}({\boldsymbol \theta}_{\boldsymbol \gamma}- \hat{\boldsymbol \theta}_{\boldsymbol \gamma})\right\}d{\boldsymbol \theta}_{\boldsymbol \gamma}\\ &=& \exp(Q_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}) \int \exp\left\{-\frac{1}{2}({\boldsymbol \theta}_{\boldsymbol \gamma}- \hat{\boldsymbol \theta}_{\boldsymbol \gamma})^{{\bf T}}\tilde{\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}({\boldsymbol \theta}_{\boldsymbol \gamma}- \hat{\boldsymbol \theta}_{\boldsymbol \gamma})\right\}d{\boldsymbol \theta}_{\boldsymbol \gamma}\\ &=& f({\bf y}|\hat{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\pi(\hat{\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma})\left|2\pi\tilde{\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}^{-1}\right|^{\frac{1}{2}}\\ &=& f({\bf y}|\hat{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\pi(\hat{\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma})\frac{(2\pi)^{|{\boldsymbol \gamma}|/2}}{|\tilde{\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}|^{1/2}}, \end{eqnarray*}\] where \(\tilde{\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}} = -{\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}\) is the negative and Hessian matrix and the observed Fisher information matrix. Taking log of it, we obtain

\[\begin{eqnarray*} 2\log f({\bf y}|{\boldsymbol \gamma}) = 2\log f({\bf y}|\hat{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma}) + 2\log \pi(\hat{\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma}) + |{\boldsymbol \gamma}|\log (2\pi) + \log {|\tilde{\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}|^{-1}}\tag{1} \end{eqnarray*}\]

Flat Prior and the Weak Law of Large Numbers

If we set a flat prior on the \({\boldsymbol \theta}_{\boldsymbol \gamma}\) such that \(\pi({\boldsymbol \theta}_{\boldsymbol \gamma}) = 1\), then each element in the matrix \(\tilde {\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}\) is \[\begin{eqnarray*} \tilde H_{mn} &=& - \frac{\partial^2\log f({\bf y}|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})}{\partial\theta_m\partial\theta_n}|_{{\boldsymbol \theta}= \hat{\boldsymbol \theta}_{\boldsymbol \gamma}} \\ &=& - \frac{\partial^2\log\left( \prod^{n}_{i=1}f(y_i|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\right)}{\partial\theta_m\partial\theta_n}|_{{\boldsymbol \theta}= \hat{\boldsymbol \theta}_{\boldsymbol \gamma}}\\ &=& - \frac{\partial^2\left( \sum^{n}_{i=1}\log f(y_i|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\right)}{\partial\theta_m\partial\theta_n}|_{{\boldsymbol \theta}= \hat{\boldsymbol \theta}_{\boldsymbol \gamma}} \\ &=& - \frac{\partial^2\left(\frac{1}{n} \sum^{n}_{i=1}n\log f(y_i|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\right)}{\partial\theta_m\partial\theta_n}|_{{\boldsymbol \theta}= \hat{\boldsymbol \theta}_{\boldsymbol \gamma}} \end{eqnarray*}\]

Since \(y_1, y_2, \ldots,y_n\) are i.i.d, according to the weak law of large numbers on the random variable \(z_i = n\log f(y_i|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\), we get

\[ \frac{1}{n} \sum^{n}_{i=1}n\log f(y_i|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})\overset{p}{\rightarrow}E[n\log f(y_i|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})] \] Therefore, each entry in the observed Fisher information matrix is

\[\begin{eqnarray*} \tilde H_{mn} &=& - \frac{\partial^2\left(E[n\log f(y_i|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})]\right)}{\partial\theta_m\partial\theta_n}|_{{\boldsymbol \theta}= \hat{\boldsymbol \theta}_{\boldsymbol \gamma}}\\ &=&-n\frac{\partial^2\left(E[\log f(y_i|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})]\right)}{\partial\theta_m\partial\theta_n}|_{{\boldsymbol \theta}= \hat{\boldsymbol \theta}_{\boldsymbol \gamma}}\\ &=&nI_{mn} \end{eqnarray*}\] for \(i=1,2,\ldots,n\). So \(\tilde{\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}} = nI_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}\), where \(I_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}} = \frac{\partial^2E[\log f(y_i|{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})]}{\partial{\boldsymbol \theta}^{{\bf T}}\partial{\boldsymbol \theta}}|_{{\boldsymbol \theta}=\hat{\boldsymbol \theta}}\) is the Fisher information matrix for a single data point \(y_i\). Thus, \(|\tilde{\bf H}_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}| = n^{|{\boldsymbol \gamma}|}|I_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}|\). We plug it to the (1) and get \[\begin{eqnarray*} 2\log f({\bf y}|{\boldsymbol \gamma}) = 2\log f({\bf y}|\hat{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma}) + 2\log \pi(\hat{\boldsymbol \theta}_{\boldsymbol \gamma}|{\boldsymbol \gamma}) + |{\boldsymbol \gamma}|\log (2\pi) -|{\boldsymbol \gamma}|\log n - \log|I_{\hat{\boldsymbol \theta}_{\boldsymbol \gamma}}|. \end{eqnarray*}\] For large \(n\), the item without \(n\) can be negleted, and we obtain a simplified form after taking a negative sign: \[\begin{eqnarray*} -2\log f({\bf y}|{\boldsymbol \gamma}) \approx -2\log f({\bf y}|\hat{\boldsymbol \theta}_{\boldsymbol \gamma},{\boldsymbol \gamma})+|{\boldsymbol \gamma}|\log n. \tag{2} \end{eqnarray*}\] The right-hand side of (2) is the BIC estimate for the model \({\boldsymbol \gamma}\).

Appendix

Laplace’s method

\[ \int_a^b e^{Mf(x)}\approx \sqrt{\frac{2\pi}{M|f^{''}(x_0)|}}e^{Mf(x_0)} \text{ as } M\rightarrow \infty, \] where \(f(x)\) is twice-differentiable function with a global maximum at \(x_0\), \(M\) is a large number, and the endpoints \(a\) and \(b\) could possibly be infinite.

General theory of Laplace’s method

Suppose \(f(x)\) function is twice-differentiable and has a unique global maximum at \(x_0\), so that we have \(f^{'}(x_0)=0\) and \(f^{''}(x_0)<0\). By Talor’s theorem,

\[\begin{eqnarray*} f(x) &=& f(x_0) + f^{'}(x_0)(x-x_0) + \frac{1}{2}f^{''}(x_0)(x-x_0)^2 + O((x-x_0)^3) \\ &\approx& f(x_0) - \frac{1}{2}|f^{''}(x_0)|(x-x_0)^2 \end{eqnarray*}\] The assumptions ensure the accuracy of the approximation

\[\begin{eqnarray*} \int_a^b e^{Mf(x)} \approx \int_a^b e^{Mf(x_0)}e^{-\frac{1}{2}M|f^{''}(x_0)|(x-x_0)^2 }dx =\sqrt{\frac{2\pi}{M|f^{''}(x_0)|}}e^{Mf(x_0)} \text{ as } M\rightarrow \infty, \end{eqnarray*}\] The second integral is a Gaussian integral if the limits of integration go from \(-\infty\) to \(\infty\) (which can be assumed because the exponential decays very fast away from \(x_0\)), and thus it can be calculated.

Prerequisites

Bias-variance tradeoff

Bias–variance decomposition of mean squared error (MSE):

1. The coefficient of determination (\(R^2\))

\(R^2\)

\(R^2\) in the model with an intercept

\(R^2\) in the model without an intercept

Inflation \(R^2\)

Ajusted \(R^2\)1

2. Bayesian information criterion (BIC)

Laplace’s approximation

Flat Prior and the Weak Law of Large Numbers

Appendix

Laplace’s method

General theory of Laplace’s method

Ajusted \(R^2\)¹