Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
As shown in picture, each node of the decision tree is associated with a region in the input space, and internal nodes break that region into one subregion for each child of the node (typically using an axis-aligned cut). Space is thus subdivided into nonoverlapping regions, with a one-to-one correspondence between leaf nodes and input regions. Each leaf node usually maps every point in its input region to the same output. Decision trees are usually trained with specialized algorithms. The learning algorithm can be considered nonparametric if it is allowed to learn a tree of arbitrary size, though decision trees are usually regularized with size constraints that turn them into parametric models in practice. Decision trees as they are typically used, with axis-aligned splits and constant outputs within each node struggle to solve some problems that are easy even for logistic regression. For example, if we have a two-class problem, and the positive class occurs wherever , the decision boundary is not axis aligned. The decision tree will thus need to approximate the decision boundary with many nodes, implementing a step function that constantly walks back and forth across the true decision function with axis-aligned steps.
Using Entropy to Find Optimal Splits
The real problem here is not in using a decision tree, but in constructing one from data alone. At any step in the process we outlined in the example above, we need to determine which feature is the right one to split the data on. That is, we need to choose the labels for the interior nodes in so that the resulting data subsets are as homogeneous as possible. In particular, it would be nice to have a quantitative way to measure the quality of a split. Then at each step we could simply choose the feature whose split yields the highest value under this measurement.
While we won’t derive such a measurement, we will use one that has an extensive history of applications: Shannon entropy.
Let be a discrete probability distribution . Then the Shannon entropy of , denoted is
Where the logarithms are taken in base 2.
In English, there are n possible outcomes numbered 1 to , and the probability that an instance drawn from results in the outcome is . Then Shannon’s entropy function computes a numerical quantity describing how “dispersed” the outcomes are.
While there are many other useful interpretations of Shannon entropy, we only need it to describe how well the data is split into its classes. For our purposes, the probability distribution will simply be the observed proportions of data with respect to their class labels. In the case of Arya’s horse riding, the initial distribution would be (1/2, 1/2), giving an entropy of 1.
Let’s verify that Shannon’s entropy function makes sense for our problem. Specifically, the best scenario for splitting the data on a feature is a perfect split; that is, each subset only has data from one class. On the other hand, the worst case would be where each subset is uniformly distributed across all classes (if there are classes, then each subset has of its data from each class).
Indeed, if we adopt the convention that , then the entropy of consists of a single term . It is clear that this does not depend on the position of the 1 within the probability distribution. On the other hand, the entropy of is
A well-known property of the entropy function tells us that this is in fact the maximum value for this function.
Summarizing this, in the best case entropy is minimized after the split, and in the worst case entropy is maximized. But we can simply look at the entropy of each subset after splitting. We need a sensible way to combine these entropies and to compare them with the entropy of the data before splitting. In particular, we would quantify the “decrease” in entropy caused by a split, and maximize that quantity.
Let be a data set and a feature with values , and let denote Shannon’s entropy function. Moreover, let denote the subset of for which the feature has the value . The gain of a split along the feature , denoted is
That is, we are taking the difference of the entropy before the split, and subtracting off the entropies of each part after splitting, with an appropriate weight depending on the size of each piece. Indeed, if the entropy grows after the split (that is if the data becomes more mixed), then this number will be small. On the other hand if the split separates the classes nicely, each subset will have small entropy, and hence the value will be large.
So now the algorithm for building trees is apparent: at each stage, simply pick the feature for which the gain function is maximized, and split the data on that feature. Create a child node for each of the subsets in the split, and connect them via edges with labels corresponding to the chosen feature value for that piece.