In this chapter, we revisit the classification problem and focus on linear methods for classification.

**\(\bullet\)** What does linear here mean?

Depending on the prediction function, the boundaries of the input space can be rough or smooth. For an important class of the methods, these boundaries are linear.

While this entire chapter is devoted to linear decision boundaries, there is considerable scope for generalization. For example, we can expand our variable set \(X_1,\cdots, X_p\) by including their squares and cross-products \(X_1^2,X_2^2,\cdots,X_1X_2,\cdots\), thereby adding \(p(p+1)/2\) additional variables. Linear functions in the augmented space map down to quadratic functions in the original space — hence linear decision boundaries to quadratic decision boundaries.This approach can be used with any basis transformation \(h(X)\), where \(h:\mathbb{R}^p\to\mathbb{R}^q\), with \(q>p\),and will be explored in later chapters.

In this section, a basic method is introduced. This method may have some problems which are solved to some extent by some other methods in later sections.

Here, each of the respinse categories are coded via an indicator variable. If \(\mathcal G\) has \(K\) classes, there will be \(K\) such indicators \(Y_k,k=1,2,\cdots, K\), with \(Y_k=1\), if \(G=k\), else \(0\). These are collected together in a vector \(Y =(Y_1,\cdots, Y_K)\), and the \(N\) training instances of these form an \(N\times K\) indicator response matrix \(Y\). Now, we can fit the linear regression model to each of the columns of \(Y\) simultaneously, and the fit is given by \[ \hat Y =X(X'X)^{-1}X'Y, \] and the coefficient matrix is \[\hat B=(X'X)^{-1}X'Y.\] To classify a new observation with input \(x\), we first comput the fitted output \(\hat f(x)^T=x\hat B\), a \(K\) vector,and then identify the largest component and classify accordingly.

**\(\triangle\)** Why this method works? What is the rationale?

**\(\bullet\)** One justification is to view the regression as an estimate of conditional expectation.

For the random variable \(Y_k, E(Y_k|X=x)=P(G=k|X=x)\), so conditional expectation of each of the \(Y_k\) seems a sensible goal. The real issue is: how good an approximation to conditional expecttion is the rather rigid linear regression model? Alternatively, are the \(\hat f(x)\) reasonable estimates of the posterior probabilities \(P(G=k|X=x)\), and more importantly, does this matter?

Although we can verify that \(\sum_{k\in\mathcal{G}}\hat f_k(x)=1\) for any \(x\), the \(\hat{f_k}(x)\) can be negative or greater than \(1\).

**\(\bullet\)** Another more simplistic viewpoint is to construct target \(t_k\) for each class, where \(t_k\) is the \(k\)th column of the \(K\times K\) identity matrix.\(y_i=t_k\) if \(g_i=k\). Then we can fit the linear model by least squares:

\[min_B\sum_{i=1}^N\|y_i-B'x_i\|^2.\] A new observation is classified by computing its fitted vector \(\hat f(x)\) and classifying to the closest target: \[\hat G(x)=argmin_k\|\hat f(x)-t_k\|^2\]

This is exactly the same as the previous approach.

**\(\triangle\) A serious problem of this method:**

When the number of classes \(K\geqslant 3\), the model may work badly(especially prevalent when \(K\) is large). Because of the rigid nature of the regression model, classes can be masked by others.

Decision theory for classification (Section 2.4) tells us that we need to know the class posteriors \(P(G|X)\) for optimal classification. Suppose \(f_k(x)\) is the class-conditional density of \(X\) in the class \(G=k\), and let \(\pi_k\) be the prior probability of class \(k\), with \(\sum_{k=1}^K \pi_k=1\). A simple application of Bayes Theorem gives us \[ P(G=k|X=x)=\frac{f_k(x)\pi_k}{\sum_{l=1}^Kf_l(x)\pi_l}. \] We see that in terms of ability to classify, having the \(f_k(x)\) is almost equivalent to having the quantity \(P(G=k|X=x)\).

Suppose that we model each class density as multivariate Gaussian \[ f_k(x)=\frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}\exp(-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)). \] Linear discriminant analysis (LDA) arises in the special case when we assume that the classes have a common covariance matrix \(\Sigma_k=\Sigma,\forall k\). In comparing tow classes \(k\) and \(l\), it is sufficient to look at the log-ratio \[ \log\frac{P(G=k|X=x)}{P(G=l|X=x)}=\log\frac{\pi_k}{\pi_l}-\frac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l)+x^T\Sigma^{-1}(\mu_k-\mu_l). \] This is an equation linear in \(x\). It implies that the decision boundary between classes \(k\) and \(l\) — the set where \(P(G=k|X=x)=P(G=l|X=x)\) is linear in \(x\) in a \(p\) dimensions hyperplane.

We can also see that the linear discriminant functions \[ \delta_k(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+\log\mu_k \] are an equivalent description of the decision rule, with \(G(x)=argmax_k\delta_k(x)\).

In practice, we do not know the parameters of the Gaussian distribution and will need to estimate them using our training data:

**\(\bullet\)** \(\hat\pi_k=N_k/N\), where \(N_k\) is the number of class-\(k\) observations;

**\(\bullet\)** \(\hat\mu_k=\sum_{g_i=k}x_i/N_k\);

**\(\bullet\)** \(\hat\Sigma=\sum_{k=1}^K\sum_{g_i=k}(x_i-\hat\mu_k)(x_i-\hat\mu_k)^T/(N-K)\).

**\(\triangle\) What if \(\Sigma_k\) are not assumed to be equal?**

The pieces quadratic in \(x\) remain. We then get quadratic discriminant functions(QDA), \[ \delta_k(x)=-\frac{1}{2}\log|\Sigma_k|-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)+\log\pi_k. \] The decision boundary between each pair of classes \(k\) and \(l\) is described by a quadratic equation \(\{x:\delta_k(x)=\delta_l(x)\}\).

In this subsection, a method which is a compromise between LDA and QDA is introduced.

A regularized covarance matrices have the form:

\[ \hat\Sigma_k(\alpha)=\alpha\hat\Sigma_k+(1-\alpha)\hat{\Sigma}, \] where \(\hat\Sigma\) is the polled covariance matrix as used in LDA. Here \(\alpha\in[0,1]\) aloows a continuum of models between LDA and QDA, and needs to be specified (by cross-validation, etc.).

Similar modifications allow \(\hat\Sigma\) itself to be shrunk toward the scalar covariance, \[ \hat\Sigma(\gamma)=\gamma\hat\Sigma+(1-\gamma)\hat\sigma^2I \] for \(\gamma\in[0,1]\).Replacing \(\hat\Sigma\) by\(\hat\Sigma(\gamma)\) leads to a more general family of covariances \(\hat\Sigma(\alpha,\gamma)\) indexed by a pair of parameters.

The computations of LDA and especially QDA are simplified by diagonalizing \(\hat\Sigma\) or \(\hat\Sigma_k\).

For the latter, suppose we compute the eigen-decomposition for each \(\hat\Sigma_k=U_kD_kU_k^T\), where \(U_k\) is \(p\times p\) orthonormal, and \(D_k\) a diagonal matrix of positive eigenvalues \(d_{kl}\). Then the ingredients for \(\delta_k(x)\) are \[(x-\hat\mu_k)^T\hat\Sigma_k^{-1}(x-\hat\mu_k)=[U_k^T(x-\hat\mu_k)]^TD_k^{-1}[U_k^T(x-\hat\mu_k)];\] \[\log|\hat\Sigma_k|=\Sigma_l\log d_{kl}.\]

Motivations:

**\(\bullet\)** Gaussian classification with common covariances leads to linear dicision boundaries. Classification can be achieved by sphering the data with respect to \(W\), and classifying to the closest centroid (modulo \(\log \pi_k\)) in the sphered space.

**\(\bullet\)** Since only the relative distances to the centroids count, one can confine the data to the subspace spanned by the centroids in the sphered space.

**\(\bullet\)** This subspace can be further decomposed into successively optimal subspaces in terms of centroid separation. This decomposition is identical to the decomposition due to Fisher.

For the detailed discussions, please refer to the textbook.

The logistic model arises from the desire to model the posterior probabilities of the \(K\) classes via linear functions in \(x\), while at the same time ensuring that they sum to one and remain in \([0,1]\).

The model has the form:

\[ \log\frac{P(G=i|X=x)}{P(G=K|X=x)}=\beta_{i0}+\beta_i^Tx,\ \ (i=1,2,\cdots,K-1). \] The model is specified in terms of \(K-1\) log-odds or logit transformations. Although the model uses the last class as the denominator in the odds-ratios, the choice of denominator is arbitrary in that the estimates are equivariant under this choice. A simple calculation shows taht

\[P(G=k|X=x)=\dfrac{\exp(\beta_{k0}+\beta_k^Tx)}{1+\sum_{l=1}^{K-1}\exp(\beta_{l0}+\beta_l^Tx)},\ k=1,2,\cdots, K-1,\] \[P(G=K|X=x)=\dfrac{1}{1+\sum_{l=1}^{K-1}\exp(\beta_{l0}+\beta_l^Tx)},\] and clearly sum to one. To emphasize the dependence on the entire parameter set \(\theta=\{\beta_{10},\beta_1^T,\cdots,\beta_{(K-1)0},\beta_{K-1}^T\}\), we denote the probabilities \(P(G=k|X=x)=P_k(x;\theta)\).

Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood of \(G\) given \(X\). Since \(P(G|X)\) completely specifies the conditional distribution, the multinomial distribution is approproate. The log-likelihood for \(N\) obserbations is \[l(\theta)=\sum_{i=1}^N\log p_{g_i}(x_i;\theta),\] where \(p_k(x_i;\theta)=P(G=k|X=x_i;\theta)\).

To maximize it, we need to solve the equation \(\frac{\partial l(\beta)}{\partial\beta}=0\), with Newton-Raphson algorithm. The whole method is referred to as iteratively reweighted least squares (IRLS). For detailed discuss, please refer to the textbook.

Logistic regression models are used mostly as a data analysis and inference tool, where the goal is to understand the role of the input variables in explaining the outcome.

The maximum-likelihood parameter estimates \(\hat\beta\) satisfy a self-consistency relationship: they are the coefficients of a weighted least squares fit, where the responses are

\[z_i=x_i^T\hat\beta+\frac{(y_i-\hat p_i)}{\hat p_i(1-\hat p_i)},\] and the weights are \(w_i=\hat p_i(1-\hat p_i)\), both depending on \(\hat\beta\) itself. Apart from providing a convenient algorithm, this connection with least quares has more to offer:

**\(\bullet\)** The weighted residual sum-of-squares is the familiar Pearson chi-square statistic \[
\sum_{i=1}^N\frac{(y_i-\hat p_i)^2}{\hat p_i(1-\hat p_i)},
\] a quadratic approximation to the deviance;

**\(\bullet\)** Asymptotic likelihood throey says that if the model is correct, then \(\hat\beta\) is consistent (i.e., converges to the true \(\beta\)).

**\(\bullet\)** A central limit theorem then shows that the distribution of \(\hat\beta\) converges to \(N(\beta,(X'WX)^{-1})\). This and other asymptotics can be derived directly from the weighted least squares fit by mimicking normal theory inference.

**\(\bullet\)** Model building can be costly for logistic regression models, because each model fitted requires iteration. Popular shortcuts are the Rao score test which tests for inclusion of a term, and the Wald test which can be used to test for exclusion of a term. Neither of these require iterative fitting, and are based on the maximum-likelihood fit of the current model. It turns out that both of these amount to adding or dropping a term from the weighted least squares fit, using the same weights. Such computations can be done efficiently, without recomputing the entire weighted least squares fit.

The \(L_1\) penalty used in the lasso can be used for variable selection and shinkage with any linear regression model. For logistic regression, we would maximize a penalized version:

\[ \max_{\beta_0,\beta}\{\sum_{i=1}^N\log p_{g_i}(x_i;\theta)-\lambda\sum_{j=1}^p|\beta_j|\}, \] where \[\log p_{g_i}(x_i;\theta)=y_i(\beta_0+\beta^Tx_i)-\log(1+\exp(\beta_0+\beta^Tx_i)).\] As with the lasso, we typically do not penalize the intercept term and standardize the predictors for the penalty to be meaningful. A solution can be found using nonlinear programming methods.

The main idea of this method is to find hyperplanes (classifier), which construct liner dicision boundaries that explicitly try to separate the data into different classes as well as possible.

For equation \(f(x)=\beta_0+\beta^Tx=0\), it defines a hyperplane (or affine set) \(L\). We have some properties:

**\(\bullet\)** \(\beta^*=\beta/\|\beta\|\) is the vector normal to the hyperplane \(L\), since for any \(x_1,x_2\) in \(L\), \(\beta^T(x_1-x_2)=0\);

**\(\bullet\)** For any point \(x_0\) in \(L\), \(\beta^Tx_0=-\beta_0\)

**\(\bullet\)** The signed distance of any point \(x\) to \(L\) is given by \(\beta^{*T}(x-x_0)=\frac{1}{\|\beta\|}(\beta^Tx+\beta_0)=\frac{1}{\|f'(x)\|}f(x)\). Hence \(f(x)\) is proportional to the signed distance from \(x\) to the hyperplane defined by \(f(x)=0\).

The perceptron learning algorithm tries to find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary.

The goal is to minimize \[D(\beta,\beta_0)=-\sum_{i\in \mathcal{M}}y_i(x_i^T\beta+\beta_0),\] where \(\mathcal M\) indexes the set of misclassified points. The algorithm in fact uses stochastic gradient descent to minimize the piecewise linear criterion.

There are a number of problems with this algorithm, summarized in Ripley(1996):

**\(\bullet\)** When the data are separable, there are many solutions, and which one is found depends on the starting values.

**\(\bullet\)** The “finite” number of steps can be very large. The smaller the gap, the longer the time to find it.

**\(\bullet\)** When the data are not separable, the algorithm will not converge, and cycles develop. The cycles can be long and therefore hard to detect.