Support Vector Machine

One of the most inﬂuential approaches to supervised learning is the support vector machine. This model is similar to logistic regression in that it is driven by a linear function $\omega^Tx+b$ . Unlike logistic regression, the support vector machine does not provide probabilities, but only outputs a class identity. The SVM predicts that the positive class is present when $\omega^Tx+b$ is positive. Likewise, it predicts that the negative class is present when $\omega^Tx+b$ is negative.

Kernel Trick

One key innovation associated with support vector machines is the kernel trick. The kernel trick consists of observing that many machine learning algorithms can be written exclusively in terms of dot products between examples. For example,it can be shown that the linear function used by the support vector machine can be re-written as:

$\omega^Tx+b=b+\sum_{i=1}^{m}\alpha_ix^Tx^{(i)}$

where $x^{(i)}$ is a training example, and $\alpha$ is a vector of coeﬃcients. Rewriting the learning algorithm this way enables us to replace $x$ with the output of a given feature function $\phi(x)$ and the dot product with a function $k(x,x^{(i)})=\phi(x)\phi(x^{(i)})$ called a kernel. The operator represents an inner product analogous to $\phi(x)^T\phi(x^{(i)})$ . For some feature spaces, we may not use literally the vector inner product. In some inﬁnite dimensional spaces, we need to use other kinds of inner products, for example, inner products based on integration rather than summation. A complete development of these kinds of inner products is beyond the scope of this book. After replacing dot products with kernel evaluations, we can make predictions using the function:

$f(x)=b=\sum_{i}\alpha_ik(x,x^{(i)})$

This function is nonlinear with respect to $x$ , but the relationship between $\phi(x)$ and $f(x)$ is linear. Also, the relationship between $\alpha$ and $f(x)$ is linear. The kernel-based function is exactly equivalent to preprocessing the data by applying $\phi(x)$ to all inputs, then learning a linear model in the new transformed space.The kernel trick is powerful for two reasons. First, it enables us to learn models that are nonlinear as a function of $x$ using convex optimization techniques that are guaranteed to converge eﬃciently. This is possible because we consider $\phi$ ﬁxed and optimize only $\alpha$ , that is, the optimization algorithm can view the decision function as being linear in a diﬀerent space. Second, the kernel function $k$ often admits an implementation that is signiﬁcantly more computationally eﬃcient than naively constructing two $\phi$ vectors and explicitly taking their dot product. In some cases, $\phi(x)$ can even be inﬁnite dimensional, which would result in an inﬁnite computational cost for the naive, explicit approach. In many cases, $k(x,x')$ is a nonlinear, tractable function of $x$ even when $\phi(x)$ is intractable. As an example of an inﬁnite-dimensional feature space with a tractable kernel, we construct a feature mapping $\phi(x)$ over the non negative integers $x$ . Suppose that this mapping returns a vector containing $x$ ones followed by inﬁnitely many zeros. We can write a kernel function

$k(x,x^{(i)})=\min_{i}(x,x^{(i)})$

that is exactly equivalent to the corresponding inﬁnite-dimensional dot product.

Gaussian kernel

The most commonly used kernel is the Gaussian kernel

$k(\upsilon,\nu)=N(\upsilon-\nu;0,\sigma^2I)$

where $N(x;\mu,\Sigma)$ is the standard normal density. This kernel is also known as the radial basis function(RBF) kernel, because its value decreases along lines in $\nu$ space radiating outward from $\upsilon$ . The Gaussian kernel corresponds to a dot product in an inﬁnite-dimensional space, but the derivation of this space is less straightforward than in our example of the $\min$ kernel over the integers. We can think of the Gaussian kernel as performing a kind of template matching. A training example $x$ associated with training label $y$ becomes a template for classy. When a test point $x'$ is near $x$ according to Euclidean distance, the Gaussian kernel has a large response, indicating that $x'$ is very similar to the $x$ template. The model then puts a large weight on the associated training label $y$ . Overall, the prediction will combine many such training labels weighted by the similarity of the corresponding training examples.

Support vector machines are not the only algorithm that can be enhanced using the kernel trick. Many other linear models can be enhanced in this way. The category of algorithms that employ the kernel trick is known as kernel machines, or kernel methods. A major drawback to kernel machines is that the cost of evaluating the decisionfunction is linear in the number of training examples, because the $i^{(th)}$ example contributes a term $\alpha_ik(x,x^{(i)})$ to the decision function. Support vector machines are able to mitigate this by learning an α vector that contains mostly zeros.Classifying a new example then requires evaluating the kernel function only for the training examples that have non zero $\alpha_i$ . These training examples are known as support vectors.

Page structure

Concept map →