In this Chapter, the linear models are of the form
\[f(X)=X\beta,\]
where \(\beta\) is unknown parameter and the input \(X_j\) can come from
\(\bullet\) quantitative inputs or their transformations;
\(\bullet\) basis expansions, such as \(X_2 = X_1^2,X_3=X_1^3\);
\(\bullet\) numeric or dummy coding of the levels of qualitative inputs;
\(\bullet\) interactions between variables.
And the basic assumption is \(E(Y|X)\) is linear, or the linear model is a reasonable approximation.
Minimizing \(RSS(\beta)=(y-X\beta)'(y-X\beta)\), we get \(\hat\beta=(X'X)^{-1}X'y\), \(\hat y=X(X'X)^{-1}X'y\), and \(H=X(X'X)^{-1}X'\) is the head matrix.
\(\triangle\) Greometrical view of least quare: an orthogonal space way.
\(\triangle\) What if \(X'X\) is singular(\(X\) not full rank)?
Then,\(\hat\beta\) is not uniquely defined. However, \(\hat y=X\hat\beta\) is still the projection of \(y\) onto the column space of \(X\). Why this happens?
\(\bullet\) One or more qualitative variables are coded in a reduncant fashion.
\(\bullet\) dimension \(p\) exceed the number of training cases \(N\).
Basically, we can use filtering methods or regularization to solve this problem.
For further work, we need more assumptions about \(Y\):
\(y_i\) are uncorrelated and have constant variance \(\sigma^2\), \(x_i\) is non-random.
Then, we know \(Var(\hat\beta)=(X'X)^{-1}\sigma^2\), and an estimation of \(\sigma^2\) is \[ \hat{\sigma}^2=\frac{1}{N-p-1}\sum_{i=1}^N(y_i-\hat y_i)^2=\frac{1}{N-p-1}Y'(I-H)Y. \] To draw inference about parameters, we need additional assumptions: \(E(Y|X)=X\beta,Y=E(Y|X)+\varepsilon\), where \(\varepsilon\sim N(0,\sigma^2)\).
Then, \(\hat\beta\sim N(\beta,(X'X)^{-1}\sigma^2)\), and \((N-p-1)\hat\sigma^2\sim \sigma^2\chi^2_{N-p-1}\).
\(\bullet\) Simple t test
\(z_j =\frac{\hat{\beta_j}}{\hat\sigma \sqrt{v_j}}\), where \(v_j\) is \(j\)th diagonal element of \((X'X)^{-1}\).
\(\bullet\) F test
\(F = \frac{(RSS_0-RSS_1)/(p_1-p_0)}{RSS_1/(N-p_1-1)}\sim F_{p_1-p_0,N-p_1-1}.\)
Then, we can derive the confidence interval as well.
In this sunsection, we only focus on setimation of any linear combination of the parameters \(\theta=a^T\beta\).
The least squares estimate of \(a'\beta\) is \[\hat\theta=a'\hat\beta=a'(X'X)^{-1}X'y.\] If we assume that the linear model is correct, then this estimate is unbiased.
Gauss-Markov Theorem For any other linear estimator \(\tilde\theta=c'y\), unbiased for \(a'\beta\),we have \[ Var(a'\hat\beta)\leqslant Var(c'y). \] Note: unbiased estimators are not necessarily better than biased estimaters since the unbiased ones may have larger variance.
\(\bullet\) MSE v.s. EPE
For \(Y_0=f(x_0)+\varepsilon_0\), the EPE of \(\tilde f(x_0)=x_0'\tilde\beta\) is \[ EPE=E(Y_0-\tilde{f}(x_0))^2=\sigma^2+E(x_0'\tilde\beta-f(x_0))^2=\sigma^2+MSE(\tilde f(x_0)) \]
\(\bullet\) Univariate model(without intercept)
\(Y=X\beta+\varepsilon\), and the least squares estimate is \(\hat\beta=\frac{<x,y>}{<x,x>}\), where \(<\cdot,\cdot>\) means innerproduct.
Fact: when inputs are orthogonal, they have no effect on each other’s parameter estimates in the model.
The idea of the following algorithm is similar to Gram-Schmidt process, but without normalizing.