Clustering (pt II)
Clustering performance evaluation
Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision and recall of a supervised classification algorithm. In particular any evaluation metric should not take the absolute values of the cluster labels into account but rather if this clustering define separations of the data similar to some ground truth set of classes or satisfying some assumption such that members belong to the same class are more similar that members of different classes according to some similarity metric.
Adjusted Rand index
Given the knowledge of the ground truth class assignments labels_true
and our clustering algorithm assignments of the same samples labels_pred
, the adjusted Rand index is a function that measures the similarity of the two assignments, ignoring permutations and with chance normalization:
>>> from sklearn import metrics >>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]>>> metrics.adjusted_rand_score(labels_true, labels_pred) 0.24...
One can permute 0 and 1 in the predicted labels, rename 2 to 3, and get the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
0.24...
Furthermore, adjusted_rand_score
is symmetric: swapping the argument does not change the score. It can thus be used as a consensus measure:
>>> metrics.adjusted_rand_score(labels_pred, labels_true)
0.24...
Perfect labeling is scored 1.0:
>>> labels_pred = labels_true[:]
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
1.0
Bad (e.g. independent labelings) have negative or close to 0.0 scores:
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
0.12...
Advantages
 Random (uniform) label assignments have a ARI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Rand index or the Vmeasure for instance).
 Bounded range [1, 1]: negative values are bad (independent labelings), similar clusterings have a positive ARI, 1.0 is the perfect match score.
 No assumption is made on the cluster structure: can be used to compare clustering algorithms such as kmeans which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.
Drawbacks
 Contrary to inertia, ARI requires knowledge of the ground truth classes while is almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).
However ARI can also be useful in a purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection (TODO).
Mathematical formulation
If C is a ground truth class assignment and K the clustering, let us define and as:
 , the number of pairs of elements that are in the same set in C and in the same set in K
 , the number of pairs of elements that are in different sets in C and in different sets in K
The raw (unadjusted) Rand index is then given by:
Where is the total number of possible pairs in the dataset (without ordering).
However the RI score does not guarantee that random label assignments will get a value close to zero (esp. if the number of clusters is in the same order of magnitude as the number of samples).
To counter this effect we can discount the expected RI of random labelings by defining the adjusted Rand index as follows:
Mutual Information based scores
Given the knowledge of the ground truth class assignments labels_true
and our clustering algorithm assignments of the same samples labels_pred
, the Mutual Information is a function that measures the agreement of the two assignments, ignoring permutations. Two different normalized versions of this measure are available, Normalized Mutual Information (NMI) and Adjusted Mutual Information (AMI). NMI is often used in the literature, while AMI was proposed more recently and is normalized against chance:
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
0.22504...
One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
0.22504...
All, mutual_info_score
, adjusted_mutual_info_score
and normalized_mutual_info_score
are symmetric: swapping the argument does not change the score. Thus they can be used as a consensus measure:
>>> metrics.adjusted_mutual_info_score(labels_pred, labels_true)
0.22504...
Perfect labeling is scored 1.0:
>>> labels_pred = labels_true[:]
>>> metrics.adjusted_mu
tual_info_score(labels_true, labels_pred)
1.0
>>> metrics.normalized_mutual_info_score(labels_true, labels_pred)
1.0
This is not true for mutual_info_score
, which is therefore harder to judge:
>>> metrics.mutual_info_score(labels_true, labels_pred)
0.69...
Bad (e.g. independent labelings) have nonpositive scores:
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
0.10526...
Advantages

Random (uniform) label assignments have a AMI score close to 0.0 for any value of
n_clusters
andn_samples
(which is not the case for raw Mutual Information or the Vmeasure for instance).  Upper bound of 1: Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement. Further, an AMI of exactly 1 indicates that the two label assignments are equal (with or without permutation).
Drawbacks

Contrary to inertia, MIbased measures require the knowledge of the ground truth classes while almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).
However MIbased measures can also be useful in purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection.
 NMI and MI are not adjusted against chance.
Mathematical formulation
Assume two label assignments (of the same N objects), and . Their entropy is the amount of uncertainty for a partition set, defined by:
where is the probability that an object picked at random from falls into class . Likewise for :
With . The mutual information (MI) between and is calculated by:
where is the probability that an object picked at random falls into both classes and .
It also can be expressed in set cardinality formulation:
The normalized mutual information is defined as
This value of the mutual information and also the normalized variant is not adjusted for chance and will tend to increase as the number of different labels (clusters) increases, regardless of the actual amount of “mutual information” between the label assignments.
The expected value for the mutual information can be calculated using the following equation. In this equation, (the number of elements in ) and (the number of elements in ).
Using the expected value, the adjusted mutual information can then be calculated using a similar form to that of the adjusted Rand index:
For normalized mutual information and adjusted mutual information, the normalizing value is typically some generalized mean of the entropies of each clustering. Various generalized means exist, and no firm rules exist for preferring one over the others. The decision is largely a fieldbyfield basis; for instance, in community detection, the arithmetic mean is most common. Each normalizing method provides “qualitatively similar behaviours” . In our implementation, this is controlled by the average_method
parameter.
Vinh et al. (2010) named variants of NMI and AMI by their averaging method. Their ‘sqrt’ and ‘sum’ averages are the geometric and arithmetic means; we use these more broadly common names.
Homogeneity, completeness and Vmeasure
Given the knowledge of the ground truth class assignments of the samples, it is possible to define some intuitive metric using conditional entropy analysis.
In particular Rosenberg and Hirschberg (2007) define the following two desirable objectives for any cluster assignment:
 homogeneity: each cluster contains only members of a single class.
 completeness: all members of a given class are assigned to the same cluster.
We can turn those concept as scores homogeneity_score
and completeness_score
. Both are bounded below by 0.0 and above by 1.0 (higher is better):
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.homogeneity_score(labels_true, labels_pred)
0.66...
>>> metrics.completeness_score(labels_true, labels_pred)
0.42...
Their harmonic mean called Vmeasure is computed by v_measure_score
:
>>> metrics.v_measure_score(labels_true, labels_pred)
0.51...
The Vmeasure is actually equivalent to the mutual information (NMI) discussed above, with the aggregation function being the arithmetic mean.
Homogeneity, completeness and Vmeasure can be computed at once using homogeneity_completeness_v_measure
as follows:
>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
...
(0.66..., 0.42..., 0.51...)
The following clustering assignment is slightly better, since it is homogeneous but not complete:
>>> labels_pred = [0, 0, 0, 1, 2, 2]
>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
...
(1.0, 0.68..., 0.81...)
Note:
v_measure_score
is symmetric: it can be used to evaluate the agreement of two independent assignments on the same dataset.
This is not the case for completeness_score and homogeneity_score: both are bound by the relationship:homogeneity_score(a, b) == completeness_score(b, a)
Advantages
 Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect score.
 Intuitive interpretation: clustering with bad Vmeasure can be qualitatively analyzed in terms of homogeneity and completeness to better feel what ‘kind’ of mistakes is done by the assignment.
 No assumption is made on the cluster structure: can be used to compare clustering algorithms such as kmeans which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.
Drawbacks

The previously introduced metrics are not normalized with regards to random labeling: this means that depending on the number of samples, clusters and ground truth classes, a completely random labeling will not always yield the same values for homogeneity, completeness and hence vmeasure. In particular random labeling won’t yield zero scores especially when the number of clusters is large.
This problem can safely be ignored when the number of samples is more than a thousand and the number of clusters is less than 10. For smaller sample sizes or larger number of clusters it is safer to use an adjusted index such as the Adjusted Rand Index (ARI).
 These metrics require the knowledge of the ground truth classes while almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).
Mathematical formulation
Homogeneity and completeness scores are formally given by:
where is the conditional entropy of the classes given the cluster assignments and is given by:
and is the entropy of the classes and is given by:
with the total number of samples, and the number of samples respectively belonging to class and cluster , and finally the number of samples from class assigned to cluster .
The conditional entropy of clusters given class and the entropy of clusters are defined in a symmetric manner.
Rosenberg and Hirschberg further define Vmeasure as the harmonic mean of homogeneity and completeness:
FowlkesMallows scores
The FowlkesMallows index (sklearn.metrics.fowlkes_mallows_score
) can be used when the ground truth class assignments of the samples is known. The FowlkesMallows score FMI is defined as the geometric mean of the pairwise precision and recall:
Where TP
is the number of True Positive (i.e. the number of pair of points that belong to the same clusters in both the true labels and the predicted labels), FP
is the number of False Positive (i.e. the number of pair of points that belong to the same clusters in the true labels and not in the predicted labels) and FN
is the number of False Negative (i.e the number of pair of points that belongs in the same clusters in the predicted labels and not in the true labels).
The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>>
>>> metrics.fowlkes_
mallows_score(labels_true, labels_pred)
0.47140...
One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
0.47140...
Perfect labeling is scored 1.0:
>>> labels_pred = labels_true[:]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
1.0
Bad (e.g. independent labelings) have zero scores:
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
0.0
Advantages
 Random (uniform) label assignments have a FMI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Mutual Information or the Vmeasure for instance).
 Upperbounded at 1: Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement. Further, values of exactly 0 indicate purely independent label assignments and a FMI of exactly 1 indicates that the two label assignments are equal (with or without permutation).
 No assumption is made on the cluster structure: can be used to compare clustering algorithms such as kmeans which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.
Drawbacks
 Contrary to inertia, FMIbased measures require the knowledge of the ground truth classes while almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).
Silhouette Coefficient
If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coefficient (sklearn.metrics.silhouette_score
) is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:
 a: The mean distance between a sample and all other points in the same class.
 b: The mean distance between a sample and all other points in the next nearest cluster.
The Silhouette Coefficient s for a single sample is then given as:
The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for each sample.
>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> dataset = datasets.load_iris()
>>> X = dataset.data
>>> y = dataset.target
In normal usage, the Silhouette Coefficient is applied to the results of a cluster analysis.
>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.silhouette_score(X, labels, metric='euclidean')
...
0.55...
Advantages
 The score is bounded between 1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.
 The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
Drawbacks
 The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.
CalinskiHarabaz Index
If the ground truth labels are not known, the CalinskiHarabaz index (sklearn.metrics.calinski_harabaz_score
)  also known as the Variance Ratio Criterion  can be used to evaluate the model, where a higher CalinskiHarabaz score relates to a model with better defined clusters.
For clusters, the CalinskiHarabaz score is given as the ratio of the betweenclusters dispersion mean and the withincluster dispersion:
where is the between group dispersion matrix and is the withincluster dispersion matrix defined by:
with be the number of points in our data, be the set of points in cluster , be the center of cluster , be the center of , be the number of points in cluster .
>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> dataset = datasets.load_iris()
>>> X = dataset.data
>>> y = dataset.target
In normal usage, the CalinskiHarabaz index is applied to the results of a cluster analysis.
>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.calinski_harabaz_score(X, labels)
561.62...
Advantages
 The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
 The score is fast to compute
Drawbacks
 The CalinskiHarabaz index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.
DaviesBouldin Index
If the ground truth labels are not known, the DaviesBouldin index (sklearn.metrics.davies_bouldin_score
) can be used to evaluate the model, where a lower DaviesBouldin index relates to a model with better separation between the clusters.
The index is defined as the average similarity between each cluster for and its most similar one . In the context of this index, similarity is defined as a measure that trades off:
 , the average distance between each point of cluster and the centroid of that cluster – also know as cluster diameter.
 , the distance between cluster centroids and .
A simple choice to construct so that it is nonnegative and symmetric is:
Then the DaviesBouldin index is defined as:
Zero is the lowest possible score. Values closer to zero indicate a better partition.
In normal usage, the DaviesBouldin index is applied to the results of a cluster analysis as follows:
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> X = iris.data
>>> from sklearn.c
luster import KMeans
>>> from sklearn.metrics import davies_bouldin_score
>>> kmeans = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans.labels_
>>> davies_bouldin_score(X, labels)
0.6619...
Advantages
 The computation of DaviesBouldin is simpler than that of Silhouette scores.
 The index is computed only quantities and features inherent to the dataset.
Drawbacks
 The DaviesBoulding index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.
 The usage of centroid distance limits the distance metric to Euclidean space.
 A good value reported by this method does not imply the best information retrieval.
Contingency Matrix
Contingency matrix (sklearn.metrics.cluster.contingency_matrix
) reports the intersection cardinality for every true/predicted cluster pair. The contingency matrix provides sufficient statistics for all clustering metrics where the samples are independent and identically distributed and one doesn’t need to account for some instances not being clustered.
Here is an example:
>>> from sklearn.metrics.cluster import contingency_matrix
>>> x = ["a", "a", "a", "b", "b", "b"]
>>> y = [0, 0, 1, 1, 2, 2]
>>> contingency_matrix(x, y)
array([ [2, 1, 0],
[0, 1, 2] ])
The first row of output array indicates that there are three samples whose true cluster is “a”. Of them, two are in predicted cluster 0, one is in 1, and none is in 2. And the second row indicates that there are three samples whose true cluster is “b”. Of them, none is in predicted cluster 0, one is in 1 and two are in 2.
A confusion matrix for classification is a square contingency matrix where the order of rows and columns correspond to a list of classes.
Advantages
 Allows to examine the spread of each true cluster across predicted clusters and vice versa.
 The contingency table calculated is typically utilized in the calculation of a similarity statistic (like the others listed in this document) between the two clusterings.
Drawbacks
 Contingency matrix is easy to interpret for a small number of clusters, but becomes very hard to interpret for a large number of clusters.
 It doesn’t give a single metric to use as an objective for clustering optimisation.