In this Chapter, the linear models are of the form

\[f(X)=X\beta,\]

where \(\beta\) is unknown parameter and the input \(X_j\) can come from

**\(\bullet\)** quantitative inputs or their transformations;

**\(\bullet\)** basis expansions, such as \(X_2 = X_1^2,X_3=X_1^3\);

**\(\bullet\)** numeric or dummy coding of the levels of qualitative inputs;

**\(\bullet\)** interactions between variables.

And the basic assumption is \(E(Y|X)\) is linear, or the linear model is a reasonable approximation.

Minimizing \(RSS(\beta)=(y-X\beta)'(y-X\beta)\), we get \(\hat\beta=(X'X)^{-1}X'y\), \(\hat y=X(X'X)^{-1}X'y\), and \(H=X(X'X)^{-1}X'\) is the head matrix.

**\(\triangle\)** Greometrical view of least quare: an orthogonal space way.

**\(\triangle\)** What if \(X'X\) is singular(\(X\) not full rank)?

Then,\(\hat\beta\) is not uniquely defined. However, \(\hat y=X\hat\beta\) is still the projection of \(y\) onto the column space of \(X\). Why this happens?

**\(\bullet\)** One or more qualitative variables are coded in a reduncant fashion.

**\(\bullet\)** dimension \(p\) exceed the number of training cases \(N\).

Basically, we can use filtering methods or regularization to solve this problem.

For further work, we need more assumptions about \(Y\):

\(y_i\) are uncorrelated and have constant variance \(\sigma^2\), \(x_i\) is non-random.

Then, we know \(Var(\hat\beta)=(X'X)^{-1}\sigma^2\), and an estimation of \(\sigma^2\) is \[ \hat{\sigma}^2=\frac{1}{N-p-1}\sum_{i=1}^N(y_i-\hat y_i)^2=\frac{1}{N-p-1}Y'(I-H)Y. \] To draw inference about parameters, we need additional assumptions: \(E(Y|X)=X\beta,Y=E(Y|X)+\varepsilon\), where \(\varepsilon\sim N(0,\sigma^2)\).

Then, \(\hat\beta\sim N(\beta,(X'X)^{-1}\sigma^2)\), and \((N-p-1)\hat\sigma^2\sim \sigma^2\chi^2_{N-p-1}\).

**\(\bullet\)** Simple t test

\(z_j =\frac{\hat{\beta_j}}{\hat\sigma \sqrt{v_j}}\), where \(v_j\) is \(j\)th diagonal element of \((X'X)^{-1}\).

**\(\bullet\)** F test

\(F = \frac{(RSS_0-RSS_1)/(p_1-p_0)}{RSS_1/(N-p_1-1)}\sim F_{p_1-p_0,N-p_1-1}.\)

Then, we can derive the confidence interval as well.

In this sunsection, we only focus on setimation of any linear combination of the parameters \(\theta=a^T\beta\).

The least squares estimate of \(a'\beta\) is \[\hat\theta=a'\hat\beta=a'(X'X)^{-1}X'y.\] If we assume that the linear model is correct, then this estimate is unbiased.

**Gauss-Markov Theorem** For any other linear estimator \(\tilde\theta=c'y\), unbiased for \(a'\beta\),we have \[
Var(a'\hat\beta)\leqslant Var(c'y).
\] Note: unbiased estimators are not necessarily better than biased estimaters since the unbiased ones may have larger variance.

**\(\bullet\)** MSE v.s. EPE

For \(Y_0=f(x_0)+\varepsilon_0\), the EPE of \(\tilde f(x_0)=x_0'\tilde\beta\) is \[ EPE=E(Y_0-\tilde{f}(x_0))^2=\sigma^2+E(x_0'\tilde\beta-f(x_0))^2=\sigma^2+MSE(\tilde f(x_0)) \]

**\(\bullet\)** Univariate model(without intercept)

\(Y=X\beta+\varepsilon\), and the least squares estimate is \(\hat\beta=\frac{<x,y>}{<x,x>}\), where \(<\cdot,\cdot>\) means innerproduct.

**Fact:** when inputs are orthogonal, they have no effect on each other’s parameter estimates in the model.

The idea of the following algorithm is similar to Gram-Schmidt process, but without normalizing.