Support Vector Machine
One of the most influential approaches to supervised learning is the support vector machine. This model is similar to logistic regression in that it is driven by a linear function . Unlike logistic regression, the support vector machine does not provide probabilities, but only outputs a class identity. The SVM predicts that the positive class is present when
is positive. Likewise, it predicts that the negative class is present when
is negative.
Kernel Trick
One key innovation associated with support vector machines is the kernel trick. The kernel trick consists of observing that many machine learning algorithms can be written exclusively in terms of dot products between examples. For example,it can be shown that the linear function used by the support vector machine can be re-written as:
where is a training example, and
is a vector of coefficients. Rewriting the learning algorithm this way enables us to replace
with the output of a given feature function
and the dot product with a function
called a kernel. The operator represents an inner product analogous to
. For some feature spaces, we may not use literally the vector inner product. In some infinite dimensional spaces, we need to use other kinds of inner products, for example, inner products based on integration rather than summation. A complete development of these kinds of inner products is beyond the scope of this book. After replacing dot products with kernel evaluations, we can make predictions using the function:
This function is nonlinear with respect to , but the relationship between
and
is linear. Also, the relationship between
and
is linear. The kernel-based function is exactly equivalent to preprocessing the data by applying
to all inputs, then learning a linear model in the new transformed space.The kernel trick is powerful for two reasons. First, it enables us to learn models that are nonlinear as a function of
using convex optimization techniques that are guaranteed to converge efficiently. This is possible because we consider
fixed and optimize only
, that is, the optimization algorithm can view the decision function as being linear in a different space. Second, the kernel function
often admits an implementation that is significantly more computationally efficient than naively constructing two
vectors and explicitly taking their dot product. In some cases,
can even be infinite dimensional, which would result in an infinite computational cost for the naive, explicit approach. In many cases,
is a nonlinear, tractable function of
even when
is intractable. As an example of an infinite-dimensional feature space with a tractable kernel, we construct a feature mapping
over the non negative integers
. Suppose that this mapping returns a vector containing
ones followed by infinitely many zeros. We can write a kernel function
that is exactly equivalent to the corresponding infinite-dimensional dot product.
Gaussian kernel
The most commonly used kernel is the Gaussian kernel
where is the standard normal density. This kernel is also known as the radial basis function(RBF) kernel, because its value decreases along lines in
space radiating outward from
. The Gaussian kernel corresponds to a dot product in an infinite-dimensional space, but the derivation of this space is less straightforward than in our example of the
kernel over the integers. We can think of the Gaussian kernel as performing a kind of template matching. A training example
associated with training label
becomes a template for classy. When a test point
is near
according to Euclidean distance, the Gaussian kernel has a large response, indicating that
is very similar to the
template. The model then puts a large weight on the associated training label
. Overall, the prediction will combine many such training labels weighted by the similarity of the corresponding training examples.
Support vector machines are not the only algorithm that can be enhanced using the kernel trick. Many other linear models can be enhanced in this way. The category of algorithms that employ the kernel trick is known as kernel machines, or kernel methods. A major drawback to kernel machines is that the cost of evaluating the decisionfunction is linear in the number of training examples, because the example contributes a term
to the decision function. Support vector machines are able to mitigate this by learning an α vector that contains mostly zeros.Classifying a new example then requires evaluating the kernel function only for the training examples that have non zero
. These training examples are known as support vectors.