Naive Bayes Classifier

In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.

Introduction

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods.

Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers. Still, a comprehensive comparison with other classification algorithms in 2006 showed that Bayes classification is outperformed by other approaches, such as boosted trees or random forests.

An advantage of naive Bayes is that it only requires a small number of training data to estimate the parameters necessary for classification.

Probabilistic model

Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector $x = (x_1, \dots, x_n)$ epresenting some n features (independent variables), it assigns to this instance probabilities

$p(C_k|x_1,\dots,x_n)$ for each of K possible outcomes or classes $C_k$ .

The problem with the above formulation is that if the number of features $n$ is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as

$p(C_k|x)=\frac{p(C_k)p(c|C_k)}{p(x)}$

In plain English, using Bayesian probability terminology, the above equation can be written as

$posterior=\frac{prior \times likelihood}{evidence}$

In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on $C$ and the values of the features $x_i$ are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model

$p(C_k,x_1,\dots,x_n)$

which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:

$p(C_k,x_1,\dots,x_n)=p(x_1,\dots,x_n,C_k)=p(x_1|x_2,\dots,x_n,C_k)p(x_2,\dots,x_n,C_k)=p(x_1|x_2,\dots,x_n,C_k)p(x_2|x_3,\dots,x_n,C_k)p(x_3,\dots,x_n,C_k)=\dots=p(x_1|x_2,\dots,x_n,C_k)p(x_2|x_3,\dots,x_n,C_k)\dots p(x_{n-1}|x_n,C_k)p(x_n|C_k)p(C_k)$

Now the "naive" conditional independence assumptions come into play: assume that each feature $x_i$ is conditionally independent of every other feature $x_j$ for $j\neq i$ , given the category $C_k$ . This means that

$p(x_i|x_{i+1},\dots,x_n,C_k)=p(x_i|C_k)$

Thus, the joint model can be expressed as

$p(C_k|x_1,\dots,x_n)\propto p(C_k,x_1,\dots,x_n)=p(C_k)p(x_1|C_k)p(x_2|C_k)p(x_3|C_k)\dots=p(C_k)\prod_{i=1}^{n}p(x_i|C_k)$

This means that under the above independence assumptions, the conditional distribution over the class variable $C$ is:

$p(C_k|x_1,\dots,x_n)=\frac{1}{Z}p(C_k)\prod_{i=1}^{n}p(x_i|C_k)$

where the evidence

$Z=p(x)=\sum_{k}p(C_k)p(x|C_k)$

is a scaling factor dependent only on $x_1,\dots,x_n$ , that is, a constant if the values of the feature variables are known.

Constructing a classifier from the probability model

The discussion so far has derived the independent feature model, that is, the naive Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable; this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class labe

$\widehat{y} = argmax(p(C_k)\prod_{i=1}^{n}p(x_i|C_k)$

Page structure

Concept map →