Model Selection Criteria
Caleb Jin / 2019-04-30
Prerequisites
Consider a multiple linear regression model as follows: where is the -dimensional response vector, is the design matrix, and . We assume that and is full rank.
By the method of MLE, we have where and is residuals sum of squares.
Bias-variance tradeoff
According to wiki:
In statistics and machine learning, the bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.
Models with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials) but may produce lower variance predictions when applied beyond the training set.
Bias–variance decomposition of mean squared error (MSE):
We assume , where and . Our goal is to find a function that makes MSE of , , minimum.
The Bias-Variance decomposition of MSE proceeds as follows: where since or they are independent. (Question : why independent implies , which implies ).
Bias-variance tradeoff
- models including many covariates leads to have low bias but high variance.
- models including few covariates leads to high bias but low variance.
Hence, we need criteria that both take in account model complexity (number of predictors) and quality of fit.
1. The coefficient of determination ()
Summary: it is not a good criterion because increases with the size of model; in other words, it always choose biggest model.
Interpretation by wiki: It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.
Denifition: where TSS is total sum of squares, RSS is residual sum of squares. And define is explained sum of squares, also called the regression sum of square. is based on the assumption that . Under the linear regression model setting satisfies this assumption usually.
Proof:
Then, the reamining part is to prove .
Firstly, due to idempotent. Then if we can show , our proof is done. However, this can not be shown for a model without an intercept.
in the model with an intercept
To see this, the partial derivative of our normal equation w.r.t is: which can be rearranged to . Thus, .
Hence in a model with intecept, we have that of that .
From this is defined as .
By the above,
in the model without an intercept
. If the second term of numerator is large positive value, then can be larger than 1 or it is a small negative value, then can be negative.
Inflation
. As is decreasing with increases in the number of regressors, the will weakly increase with addtional explanatory variables.
Ajusted 1
Define adusted as
= , where is the sample variance of the estimated residuals and is sample variance of dependent variable. They can be considered as unbiased estimates of the population variances of the errors and of the dependent variable. Hence, adjusted can be interpreted as an unbiased estimator of the population .
2. Bayesian information criterion (BIC)
This is the derivation of the Bayesian information criterion (BIC) for the model selection. The main content following refers to the note from Dr. Bhat 2.
Laplace’s approximation
Define an index set of the active predictors for . So can be treated as a model we consider. The Bayesian approach to the model selection is to maximize the posterior distribution of a model given the data . By Bayes theorem, , where We expand by Taylor expansion about its posterior mode where attains the maximimum. Thus, where and is a Hessian matrix such that . Note that and is negative definite at .
Therefore, where is the negative and Hessian matrix and the observed Fisher information matrix. Taking log of it, we obtain
Flat Prior and the Weak Law of Large Numbers
If we set a flat prior on the such that , then each element in the matrix is
Since are i.i.d, according to the weak law of large numbers on the random variable , we get
Therefore, each entry in the observed Fisher information matrix is
for . So , where is the Fisher information matrix for a single data point . Thus, . We plug it to the (1) and get For large , the item without can be negleted, and we obtain a simplified form after taking a negative sign: The right-hand side of (2) is the BIC estimate for the model .
Appendix
Laplace’s method
where is twice-differentiable function with a global maximum at , is a large number, and the endpoints and could possibly be infinite.
General theory of Laplace’s method
Suppose function is twice-differentiable and has a unique global maximum at , so that we have and . By Talor’s theorem,
The assumptions ensure the accuracy of the approximation
The second integral is a Gaussian integral if the limits of integration go from to (which can be assumed because the exponential decays very fast away from ), and thus it can be calculated.