# Model Selection Criteria

### Caleb Jin / 2019-04-30

## Prerequisites

Consider a multiple linear regression model as follows: \[\begin{eqnarray*} {\bf y}={\bf X}{\boldsymbol \beta}+ {\boldsymbol \epsilon}, \end{eqnarray*}\] where \({\bf y}=(y_1,y_2,\dots,y_n)^{{\bf T}}\) is the \(n\)-dimensional response vector, \({\bf X}=({\bf x}_1,{\bf x}_2, \dots, {\bf x}_n)^{{\bf T}}\) is the \(n\times p\) design matrix, and \({\boldsymbol \epsilon}\sim \mathcal{N}_n({\boldsymbol 0},\sigma^2{\bf I}_n)\). We assume that \(p<n\) and \({\bf X}\) is full rank.

By the method of MLE, we have \[\begin{eqnarray*} &&\hat{\boldsymbol \beta}=({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}{\bf y}\\ &&{\hat\sigma}^2 = \frac{RSS}{n}=\frac{||{\bf y}-{\bf X}\hat{\boldsymbol \beta}||^2}{n} = \frac{{\bf y}^{{\bf T}}({\bf I}-{\bf H}){\bf y}}{n} = \frac{{\bf y}^{{\bf T}}{\bf P}{\bf y}}{n}, \end{eqnarray*}\] where \({\bf P}= {\bf I}-{\bf H}; {\bf H}= {\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}\) and \(RSS\) is residuals sum of squares.

### Bias-variance tradeoff

According to wiki:

In statistics and machine learning, the bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.

Models with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials) but may produce lower variance predictions when applied beyond the training set.

### Bias–variance decomposition of mean squared error (MSE):

We assume \({\bf y}= f(x) + \varepsilon\), where \(\mathbb{E}(\varepsilon)=0\) and \(\text{Var}(\varepsilon)=\sigma^2\). Our goal is to find a function \(\hat f(x)\) that makes MSE of \(\hat f\), \(\mathbb{E}\{({\bf y}-\hat f)^{{\bf T}}({\bf y}-\hat f)\}\), minimum.

The Bias-Variance decomposition of MSE proceeds as follows:
\[\begin{eqnarray*}
&&\mathbb{E}\{({\bf y}-\hat f)^{{\bf T}}({\bf y}-\hat f)\} = \{\mathbb{E}({\bf y}-\hat f)\}^{{\bf T}}\mathbb{E}({\bf y}-\hat f) + \text{Var}({\bf y}-\hat f)\\
&=&||\text{Bias}(\hat f)||^2 + \text{Var}({\bf y})+\text{Var}(\hat f) - 2\text{cov}({\bf y},\hat f)\\
&=& ||\text{Bias}(\hat f)||^2 +\text{Var}(\hat f) + \sigma^2,
\end{eqnarray*}\]
where
\[\begin{eqnarray*}\text{cov}({\bf y},\hat f)
&=& \mathbb{E}({\bf y}\hat f) - \mathbb{E}({\bf y})\mathbb{E}(\hat f)\\
&=& \mathbb{E}[(f+\varepsilon)\hat f] - \mathbb{E}(f+\varepsilon)\mathbb{E}(\hat f)\\
&=& f\mathbb{E}(\hat f) + \mathbb{E}(\varepsilon\hat f) - f\mathbb{E}(\hat f)\\
&=&\mathbb{E}(\varepsilon\hat f)\\
&=&0,
\end{eqnarray*}\]
since \(\varepsilon \bot \hat f\) or they are independent. (*Question* : why independent implies \(\varepsilon \bot \hat f\), which implies \(\mathbb{E}(\varepsilon\hat f)=0\)).

Bias-variance tradeoff

- models including many covariates leads to have low bias but high variance.
- models including few covariates leads to high bias but low variance.

Hence, we need criteria that both take in account model complexity (number of predictors) and quality of fit.

## The coefficient of determination (\(R^2\))

**Summary**: it is not a good criterion because \(R^2\) increases with the size of model; in other words, it always choose biggest model.

**Interpretation** by wiki:
It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. **It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.**

### \(R^2\)

**Denifition**:
\[\begin{eqnarray}
\text{R}^2 = 1-\frac{RSS}{TSS} = 1- \frac{\sum_i(y_i-\hat{f_i})^2}{\sum_i(y_i-\bar y)^2},
\end{eqnarray}\]
where TSS is total sum of squares, RSS is residual sum of squares. And define \(\text{ESS} = \sum_i(\hat f - \bar y)^2\) is explained sum of squares, also called the regression sum of square. \(R^2\) is based on the assumption that \(TSS = RSS + ESS\). Under the linear regression model setting satisfies this assumption usually.

**Proof**:

\[\begin{eqnarray*} &&\sum_i(y_i-\bar y)^2 = \sum(y_i-\hat{f_i}+ \hat{f_i} - \bar y)^2 \\ &=&\sum_i(y_i-\hat{f_i})^2 + \sum_i(\hat{f_i} - \bar y)^2 + 2\sum_i(y_i-\hat{f_i})(\hat{f_i} - \bar y)\\ &=& RSS + ESS + 2\sum_i\hat{e}_i(\hat{f_i} - \bar y) \,(\hat{f_i}=\hat{y_i}={\bf X}\hat{{\boldsymbol \beta}}\enspace\text{in linear model}) \\ &=& RSS + ESS + 2\sum_i\hat{e}_i(\hat{y_i} - \bar y)\\ &=& RSS + ESS + 2\sum_i\hat{e}_i\hat{y_i}-2\bar y\sum_i\hat{e}_i \end{eqnarray*}\] Then, the reamining part is to prove \(\sum_i\hat{e}_i(\hat{y_i} - \bar y)=0\).

Firstly, \(\sum_i\hat{e}_i\hat{y_i} = {\bf e}^{{\bf T}}{\bf H}{\bf y}= {\bf y}^{{\bf T}}({\bf I}-{\bf H}){\bf H}{\bf y}= 0\) due to \({\bf H}\) idempotent.
Then if we can show \(\sum_i \hat{e}_i=0\), our proof is done.
**However, this can not be shown for a model without an intercept.**

#### \(R^2\) in the model with an intercept

To see this, the partial derivative of our normal equation w.r.t \(\beta_0\) is: \[ \frac{\partial ESS}{\partial\hat\beta_0} = \frac{\sum_i(y_i-\hat\beta_0-x_i\hat\beta_1)^2}{\partial\hat\beta_0} = -2\sum_i(y_i-\hat\beta_0-x_i\hat\beta_1)=0, \] which can be rearranged to \(\sum_iy_i = \sum_i\hat\beta_0+\hat\beta_1\sum_ix_i=\sum_i\hat y_i\). Thus, \(\sum\hat e_i = \sum_iy_i - \sum_i\hat y_i = 0\).

Hence in a model with intecept, we have that \(TSS = RSS + ESS\) of that \(1 = \frac{RSS}{TSS} + \frac{ESS}{TSS}\).

From this \(R^2\) is defined as \(R^2\overset{def}{=}1-\frac{RSS}{TSS}\).

By the above, \(R^2\geq0\)

#### \(R^2\) in the model without an intercept

\(R^2\overset{def}{=}1-\frac{RSS}{TSS} = \frac{ESS+2\sum_i(y_i-\hat{y_i})(\hat{y_i} - \bar y)}{\sum_i(y_i-\bar y)^2}\). If the second term of numerator is large positive value, then \(R^2\) can be larger than 1 or it is a small negative value, then \(R^2\) can be negative.

#### Inflation \(R^2\)

\(\max_{{\boldsymbol \beta}}R^2 = \min_{{\boldsymbol \beta}}RSS\). As \(RSS\) is decreasing with increases in the number of regressors, the \(R^2\) will weakly increase with addtional explanatory variables.

### Ajusted \(R^2\)^{1}

Define adusted \(R^2\) as

\(R_a^2 = 1-\frac{RSS/df_e}{TSS/df_t} = 1-\frac{RSS/(n-p-1)}{TSS/(n-1)}\) = \(1-\frac{Var(\hat\sigma^2)}{Var_{tot}}\), where \(Var(\hat\sigma^2)\) is the sample variance of the estimated residuals and \(Var_{tot}\) is sample variance of dependent variable. They can be considered as unbiased estimates of the population variances of the errors and of the dependent variable. Hence, adjusted \(R^2\) can be interpreted as an unbiased estimator of the population \(R^2\).