Caleb

Stats Ph.D.

Model Selection Criteria

Caleb Jin / 2019-04-30


Prerequisites

Consider a multiple linear regression model as follows: y=Xβ+ϵ, where y=(y1,y2,,yn)T is the n-dimensional response vector, X=(x1,x2,,xn)T is the n×p design matrix, and ϵNn(0,σ2In). We assume that p<n and X is full rank.

By the method of MLE, we have β^=(XTX)1XTyσ^2=RSSn=||yXβ^||2n=yT(IH)yn=yTPyn, where P=IH;H=X(XTX)1XT and RSS is residuals sum of squares.

Bias-variance tradeoff

According to wiki:

In statistics and machine learning, the bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.

Models with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials) but may produce lower variance predictions when applied beyond the training set.

Bias–variance decomposition of mean squared error (MSE):

We assume y=f(x)+ε, where E(ε)=0 and Var(ε)=σ2. Our goal is to find a function f^(x) that makes MSE of f^, E{(yf^)T(yf^)}, minimum.

The Bias-Variance decomposition of MSE proceeds as follows: E{(yf^)T(yf^)}={E(yf^)}TE(yf^)+Var(yf^)=||Bias(f^)||2+Var(y)+Var(f^)2cov(y,f^)=||Bias(f^)||2+Var(f^)+σ2, where cov(y,f^)=E(yf^)E(y)E(f^)=E[(f+ε)f^]E(f+ε)E(f^)=fE(f^)+E(εf^)fE(f^)=E(εf^)=0, since εf^ or they are independent. (Question : why independent implies εf^, which implies E(εf^)=0).

Bias-variance tradeoff

  • models including many covariates leads to have low bias but high variance.
  • models including few covariates leads to high bias but low variance.

Hence, we need criteria that both take in account model complexity (number of predictors) and quality of fit.

1. The coefficient of determination (R2)

Summary: it is not a good criterion because R2 increases with the size of model; in other words, it always choose biggest model.

Interpretation by wiki: It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

R2

Denifition: R2=1RSSTSS=1i(yifi^)2i(yiy¯)2, where TSS is total sum of squares, RSS is residual sum of squares. And define ESS=i(f^y¯)2 is explained sum of squares, also called the regression sum of square. R2 is based on the assumption that TSS=RSS+ESS. Under the linear regression model setting satisfies this assumption usually.

Proof:

i(yiy¯)2=(yifi^+fi^y¯)2=i(yifi^)2+i(fi^y¯)2+2i(yifi^)(fi^y¯)=RSS+ESS+2ie^i(fi^y¯)(fi^=yi^=Xβ^in linear model)=RSS+ESS+2ie^i(yi^y¯)=RSS+ESS+2ie^iyi^2y¯ie^i Then, the reamining part is to prove ie^i(yi^y¯)=0.

Firstly, ie^iyi^=eTHy=yT(IH)Hy=0 due to H idempotent. Then if we can show ie^i=0, our proof is done. However, this can not be shown for a model without an intercept.

R2 in the model with an intercept

To see this, the partial derivative of our normal equation w.r.t β0 is: ESSβ^0=i(yiβ^0xiβ^1)2β^0=2i(yiβ^0xiβ^1)=0, which can be rearranged to iyi=iβ^0+β^1ixi=iy^i. Thus, e^i=iyiiy^i=0.

Hence in a model with intecept, we have that TSS=RSS+ESS of that 1=RSSTSS+ESSTSS.

From this R2 is defined as R2=def1RSSTSS.

By the above, R20

R2 in the model without an intercept

R2=def1RSSTSS=ESS+2i(yiyi^)(yi^y¯)i(yiy¯)2. If the second term of numerator is large positive value, then R2 can be larger than 1 or it is a small negative value, then R2 can be negative.

Inflation R2

maxβR2=minβRSS. As RSS is decreasing with increases in the number of regressors, the R2 will weakly increase with addtional explanatory variables.

Ajusted R21

Define adusted R2 as

Ra2=1RSS/dfeTSS/dft=1RSS/(np1)TSS/(n1) = 1Var(σ^2)Vartot, where Var(σ^2) is the sample variance of the estimated residuals and Vartot is sample variance of dependent variable. They can be considered as unbiased estimates of the population variances of the errors and of the dependent variable. Hence, adjusted R2 can be interpreted as an unbiased estimator of the population R2.

2. Bayesian information criterion (BIC)

This is the derivation of the Bayesian information criterion (BIC) for the model selection. The main content following refers to the note from Dr. Bhat 2.

Laplace’s approximation

Define an index set of the active predictors γ={j:βj0} for j=1,2,,p. So γ can be treated as a model we consider. The Bayesian approach to the model selection is to maximize the posterior distribution of a model given the data π(γ|y). By Bayes theorem, π(γ|y)f(y|γ)π(γ), where f(y|γ)=f(y|θγ,γ)π(θγ|γ)dθγ=exp[log(f(y|θγ,γ)π(θγ|γ))]dθγ. We expand log(f(y|θγ,γ)π(θγ|γ)) by Taylor expansion about its posterior mode θ^γ where f(y|θγ,γ)π(θγ|γ) attains the maximimum. Thus, log(f(y|θγ,γ)π(θγ|γ))log(f(y|θ^γ,γ)π(θ^γ|γ))+(θγθ^γ)θQ|θ^+12(θγθ^γ)THθ^γ(θγθ^γ), where Qθ^γ=log(f(y|θγ,γ)π(θγ|γ)) and Hθ^γ is a Hessian matrix such that Hθ^γ=2QθTθ|θ^γ. Note that θQθ^=0 and Hθ^γ is negative definite at θ^γ.

Therefore, f(y|γ)=exp[log(f(y|θγ,γ)π(θγ|γ))]dθγexp{Qθ^γ12(θγθ^γ)TH~θ^γ(θγθ^γ)}dθγ=exp(Qθ^γ)exp{12(θγθ^γ)TH~θ^γ(θγθ^γ)}dθγ=f(y|θ^γ,γ)π(θ^γ|γ)|2πH~θ^γ1|12=f(y|θ^γ,γ)π(θ^γ|γ)(2π)|γ|/2|H~θ^γ|1/2, where H~θ^γ=Hθ^γ is the negative and Hessian matrix and the observed Fisher information matrix. Taking log of it, we obtain

(1)2logf(y|γ)=2logf(y|θ^γ,γ)+2logπ(θ^γ|γ)+|γ|log(2π)+log|H~θ^γ|1

Flat Prior and the Weak Law of Large Numbers

If we set a flat prior on the θγ such that π(θγ)=1, then each element in the matrix H~θ^γ is H~mn=2logf(y|θγ,γ)θmθn|θ=θ^γ=2log(i=1nf(yi|θγ,γ))θmθn|θ=θ^γ=2(i=1nlogf(yi|θγ,γ))θmθn|θ=θ^γ=2(1ni=1nnlogf(yi|θγ,γ))θmθn|θ=θ^γ

Since y1,y2,,yn are i.i.d, according to the weak law of large numbers on the random variable zi=nlogf(yi|θγ,γ), we get

1ni=1nnlogf(yi|θγ,γ)pE[nlogf(yi|θγ,γ)] Therefore, each entry in the observed Fisher information matrix is

H~mn=2(E[nlogf(yi|θγ,γ)])θmθn|θ=θ^γ=n2(E[logf(yi|θγ,γ)])θmθn|θ=θ^γ=nImn for i=1,2,,n. So H~θ^γ=nIθ^γ, where Iθ^γ=2E[logf(yi|θγ,γ)]θTθ|θ=θ^ is the Fisher information matrix for a single data point yi. Thus, |H~θ^γ|=n|γ||Iθ^γ|. We plug it to the (1) and get 2logf(y|γ)=2logf(y|θ^γ,γ)+2logπ(θ^γ|γ)+|γ|log(2π)|γ|lognlog|Iθ^γ|. For large n, the item without n can be negleted, and we obtain a simplified form after taking a negative sign: (2)2logf(y|γ)2logf(y|θ^γ,γ)+|γ|logn. The right-hand side of (2) is the BIC estimate for the model γ.

Appendix

Laplace’s method

abeMf(x)2πM|f(x0)|eMf(x0) as M, where f(x) is twice-differentiable function with a global maximum at x0, M is a large number, and the endpoints a and b could possibly be infinite.

General theory of Laplace’s method

Suppose f(x) function is twice-differentiable and has a unique global maximum at x0, so that we have f(x0)=0 and f(x0)<0. By Talor’s theorem,

f(x)=f(x0)+f(x0)(xx0)+12f(x0)(xx0)2+O((xx0)3)f(x0)12|f(x0)|(xx0)2 The assumptions ensure the accuracy of the approximation

abeMf(x)abeMf(x0)e12M|f(x0)|(xx0)2dx=2πM|f(x0)|eMf(x0) as M, The second integral is a Gaussian integral if the limits of integration go from to (which can be assumed because the exponential decays very fast away from x0), and thus it can be calculated.


  1. https://en.wikipedia.org/wiki/Coefficient_of_determination#Extensions

  2. https://faculty.ucmerced.edu/hbhat/BICderivation.pdf