In this Chapter, the linear models are of the form
f(X)=Xβ,
where β is unknown parameter and the input Xj can come from
∙ quantitative inputs or their transformations;
∙ basis expansions, such as X2=X21,X3=X31;
∙ numeric or dummy coding of the levels of qualitative inputs;
∙ interactions between variables.
And the basic assumption is E(Y|X) is linear, or the linear model is a reasonable approximation.
Minimizing RSS(β)=(y−Xβ)′(y−Xβ), we get ˆβ=(X′X)−1X′y, ˆy=X(X′X)−1X′y, and H=X(X′X)−1X′ is the head matrix.
△ Greometrical view of least quare: an orthogonal space way.
△ What if X′X is singular(X not full rank)?
Then,ˆβ is not uniquely defined. However, ˆy=Xˆβ is still the projection of y onto the column space of X. Why this happens?
∙ One or more qualitative variables are coded in a reduncant fashion.
∙ dimension p exceed the number of training cases N.
Basically, we can use filtering methods or regularization to solve this problem.
For further work, we need more assumptions about Y:
yi are uncorrelated and have constant variance σ2, xi is non-random.
Then, we know Var(ˆβ)=(X′X)−1σ2, and an estimation of σ2 is ˆσ2=1N−p−1N∑i=1(yi−ˆyi)2=1N−p−1Y′(I−H)Y. To draw inference about parameters, we need additional assumptions: E(Y|X)=Xβ,Y=E(Y|X)+ε, where ε∼N(0,σ2).
Then, ˆβ∼N(β,(X′X)−1σ2), and (N−p−1)ˆσ2∼σ2χ2N−p−1.
∙ Simple t test
zj=^βjˆσ√vj, where vj is jth diagonal element of (X′X)−1.
∙ F test
F=(RSS0−RSS1)/(p1−p0)RSS1/(N−p1−1)∼Fp1−p0,N−p1−1.
Then, we can derive the confidence interval as well.
In this sunsection, we only focus on setimation of any linear combination of the parameters θ=aTβ.
The least squares estimate of a′β is ˆθ=a′ˆβ=a′(X′X)−1X′y. If we assume that the linear model is correct, then this estimate is unbiased.
Gauss-Markov Theorem For any other linear estimator ˜θ=c′y, unbiased for a′β,we have Var(a′ˆβ)⩽ Note: unbiased estimators are not necessarily better than biased estimaters since the unbiased ones may have larger variance.
\bullet MSE v.s. EPE
For Y_0=f(x_0)+\varepsilon_0, the EPE of \tilde f(x_0)=x_0'\tilde\beta is EPE=E(Y_0-\tilde{f}(x_0))^2=\sigma^2+E(x_0'\tilde\beta-f(x_0))^2=\sigma^2+MSE(\tilde f(x_0))
\bullet Univariate model(without intercept)
Y=X\beta+\varepsilon, and the least squares estimate is \hat\beta=\frac{<x,y>}{<x,x>}, where <\cdot,\cdot> means innerproduct.
Fact: when inputs are orthogonal, they have no effect on each other’s parameter estimates in the model.
The idea of the following algorithm is similar to Gram-Schmidt process, but without normalizing.