Shiqiang Jin

Model Selection Criteria

Caleb Jin / 2019-04-30


Prerequisites

Consider a multiple linear regression model as follows: \[\begin{eqnarray*} {\bf y}={\bf X}{\boldsymbol \beta}+ {\boldsymbol \epsilon}, \end{eqnarray*}\] where \({\bf y}=(y_1,y_2,\dots,y_n)^{{\bf T}}\) is the \(n\)-dimensional response vector, \({\bf X}=({\bf x}_1,{\bf x}_2, \dots, {\bf x}_n)^{{\bf T}}\) is the \(n\times p\) design matrix, and \({\boldsymbol \epsilon}\sim \mathcal{N}_n({\boldsymbol 0},\sigma^2{\bf I}_n)\). We assume that \(p<n\) and \({\bf X}\) is full rank.

By the method of MLE, we have \[\begin{eqnarray*} &&\hat{\boldsymbol \beta}=({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}{\bf y}\\ &&{\hat\sigma}^2 = \frac{RSS}{n}=\frac{||{\bf y}-{\bf X}\hat{\boldsymbol \beta}||^2}{n} = \frac{{\bf y}^{{\bf T}}({\bf I}-{\bf H}){\bf y}}{n} = \frac{{\bf y}^{{\bf T}}{\bf P}{\bf y}}{n}, \end{eqnarray*}\] where \({\bf P}= {\bf I}-{\bf H}; {\bf H}= {\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}\) and \(RSS\) is residuals sum of squares.

Bias-variance tradeoff

According to wiki:

In statistics and machine learning, the bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.

Models with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials) but may produce lower variance predictions when applied beyond the training set.

Bias–variance decomposition of mean squared error (MSE):

We assume \({\bf y}= f(x) + \varepsilon\), where \(\mathbb{E}(\varepsilon)=0\) and \(\text{Var}(\varepsilon)=\sigma^2\). Our goal is to find a function \(\hat f(x)\) that makes MSE of \(\hat f\), \(\mathbb{E}\{({\bf y}-\hat f)^{{\bf T}}({\bf y}-\hat f)\}\), minimum.

The Bias-Variance decomposition of MSE proceeds as follows: \[\begin{eqnarray*} &&\mathbb{E}\{({\bf y}-\hat f)^{{\bf T}}({\bf y}-\hat f)\} = \{\mathbb{E}({\bf y}-\hat f)\}^{{\bf T}}\mathbb{E}({\bf y}-\hat f) + \text{Var}({\bf y}-\hat f)\\ &=&||\text{Bias}(\hat f)||^2 + \text{Var}({\bf y})+\text{Var}(\hat f) - 2\text{cov}({\bf y},\hat f)\\ &=& ||\text{Bias}(\hat f)||^2 +\text{Var}(\hat f) + \sigma^2, \end{eqnarray*}\] where \[\begin{eqnarray*}\text{cov}({\bf y},\hat f) &=& \mathbb{E}({\bf y}\hat f) - \mathbb{E}({\bf y})\mathbb{E}(\hat f)\\ &=& \mathbb{E}[(f+\varepsilon)\hat f] - \mathbb{E}(f+\varepsilon)\mathbb{E}(\hat f)\\ &=& f\mathbb{E}(\hat f) + \mathbb{E}(\varepsilon\hat f) - f\mathbb{E}(\hat f)\\ &=&\mathbb{E}(\varepsilon\hat f)\\ &=&0, \end{eqnarray*}\] since \(\varepsilon \bot \hat f\) or they are independent. (Question : why independent implies \(\varepsilon \bot \hat f\), which implies \(\mathbb{E}(\varepsilon\hat f)=0\)).

Bias-variance tradeoff

  • models including many covariates leads to have low bias but high variance.
  • models including few covariates leads to high bias but low variance.

Hence, we need criteria that both take in account model complexity (number of predictors) and quality of fit.

The coefficient of determination (\(R^2\))

Summary: it is not a good criterion because \(R^2\) increases with the size of model; in other words, it always choose biggest model.

Interpretation by wiki: It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

\(R^2\)

Denifition: \[\begin{eqnarray} \text{R}^2 = 1-\frac{RSS}{TSS} = 1- \frac{\sum_i(y_i-\hat{f_i})^2}{\sum_i(y_i-\bar y)^2}, \end{eqnarray}\] where TSS is total sum of squares, RSS is residual sum of squares. And define \(\text{ESS} = \sum_i(\hat f - \bar y)^2\) is explained sum of squares, also called the regression sum of square. \(R^2\) is based on the assumption that \(TSS = RSS + ESS\). Under the linear regression model setting satisfies this assumption usually.

Proof:

\[\begin{eqnarray*} &&\sum_i(y_i-\bar y)^2 = \sum(y_i-\hat{f_i}+ \hat{f_i} - \bar y)^2 \\ &=&\sum_i(y_i-\hat{f_i})^2 + \sum_i(\hat{f_i} - \bar y)^2 + 2\sum_i(y_i-\hat{f_i})(\hat{f_i} - \bar y)\\ &=& RSS + ESS + 2\sum_i\hat{e}_i(\hat{f_i} - \bar y) \,(\hat{f_i}=\hat{y_i}={\bf X}\hat{{\boldsymbol \beta}}\enspace\text{in linear model}) \\ &=& RSS + ESS + 2\sum_i\hat{e}_i(\hat{y_i} - \bar y)\\ &=& RSS + ESS + 2\sum_i\hat{e}_i\hat{y_i}-2\bar y\sum_i\hat{e}_i \end{eqnarray*}\] Then, the reamining part is to prove \(\sum_i\hat{e}_i(\hat{y_i} - \bar y)=0\).

Firstly, \(\sum_i\hat{e}_i\hat{y_i} = {\bf e}^{{\bf T}}{\bf H}{\bf y}= {\bf y}^{{\bf T}}({\bf I}-{\bf H}){\bf H}{\bf y}= 0\) due to \({\bf H}\) idempotent. Then if we can show \(\sum_i \hat{e}_i=0\), our proof is done. However, this can not be shown for a model without an intercept.

\(R^2\) in the model with an intercept

To see this, the partial derivative of our normal equation w.r.t \(\beta_0\) is: \[ \frac{\partial ESS}{\partial\hat\beta_0} = \frac{\sum_i(y_i-\hat\beta_0-x_i\hat\beta_1)^2}{\partial\hat\beta_0} = -2\sum_i(y_i-\hat\beta_0-x_i\hat\beta_1)=0, \] which can be rearranged to \(\sum_iy_i = \sum_i\hat\beta_0+\hat\beta_1\sum_ix_i=\sum_i\hat y_i\). Thus, \(\sum\hat e_i = \sum_iy_i - \sum_i\hat y_i = 0\).

Hence in a model with intecept, we have that \(TSS = RSS + ESS\) of that \(1 = \frac{RSS}{TSS} + \frac{ESS}{TSS}\).

From this \(R^2\) is defined as \(R^2\overset{def}{=}1-\frac{RSS}{TSS}\).

By the above, \(R^2\geq0\)

\(R^2\) in the model without an intercept

\(R^2\overset{def}{=}1-\frac{RSS}{TSS} = \frac{ESS+2\sum_i(y_i-\hat{y_i})(\hat{y_i} - \bar y)}{\sum_i(y_i-\bar y)^2}\). If the second term of numerator is large positive value, then \(R^2\) can be larger than 1 or it is a small negative value, then \(R^2\) can be negative.

Inflation \(R^2\)

\(\max_{{\boldsymbol \beta}}R^2 = \min_{{\boldsymbol \beta}}RSS\). As \(RSS\) is decreasing with increases in the number of regressors, the \(R^2\) will weakly increase with addtional explanatory variables.

Ajusted \(R^2\)1

Define adusted \(R^2\) as

\(R_a^2 = 1-\frac{RSS/df_e}{TSS/df_t} = 1-\frac{RSS/(n-p-1)}{TSS/(n-1)}\) = \(1-\frac{Var(\hat\sigma^2)}{Var_{tot}}\), where \(Var(\hat\sigma^2)\) is the sample variance of the estimated residuals and \(Var_{tot}\) is sample variance of dependent variable. They can be considered as unbiased estimates of the population variances of the errors and of the dependent variable. Hence, adjusted \(R^2\) can be interpreted as an unbiased estimator of the population \(R^2\).


  1. https://en.wikipedia.org/wiki/Coefficient_of_determination#Extensions