# Clustering performance evaluation

Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision and recall of a supervised classification algorithm. In particular any evaluation metric should not take the absolute values of the cluster labels into account but rather if this clustering define separations of the data similar to some ground truth set of classes or satisfying some assumption such that members belong to the same class are more similar that members of different classes according to some similarity metric.

Given the knowledge of the ground truth class assignments labels_true and our clustering algorithm assignments of the same samples labels_pred, the adjusted Rand index is a function that measures the similarity of the two assignments, ignoring permutations and with chance normalization:

>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]

>>> metrics.adjusted_rand_score(labels_true, labels_pred)
0.24...


One can permute 0 and 1 in the predicted labels, rename 2 to 3, and get the same score:

>>> labels_pred = [1, 1, 0, 0, 3, 3]
0.24...


Furthermore, adjusted_rand_score is symmetric: swapping the argument does not change the score. It can thus be used as a consensus measure:

>>> metrics.adjusted_rand_score(labels_pred, labels_true)
0.24...


Perfect labeling is scored 1.0:

 >>> labels_pred = labels_true[:]
1.0


Bad (e.g. independent labelings) have negative or close to 0.0 scores:

>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
-0.12...


• Random (uniform) label assignments have a ARI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Rand index or the V-measure for instance).
• Bounded range [-1, 1]: negative values are bad (independent labelings), similar clusterings have a positive ARI, 1.0 is the perfect match score.
• No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.

### Drawbacks

• Contrary to inertia, ARI requires knowledge of the ground truth classes while is almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).

However ARI can also be useful in a purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection (TODO).

### Mathematical formulation

If C is a ground truth class assignment and K the clustering, let us define $a$ and $b$ as:

• $a$, the number of pairs of elements that are in the same set in C and in the same set in K
• $b$, the number of pairs of elements that are in different sets in C and in different sets in K

The raw (unadjusted) Rand index is then given by:

$RI=\frac{a+b}{C_2^{n_{samples}}}$

Where $C_2^{n_{samples}}$ is the total number of possible pairs in the dataset (without ordering).

However the RI score does not guarantee that random label assignments will get a value close to zero (esp. if the number of clusters is in the same order of magnitude as the number of samples).

To counter this effect we can discount the expected RI $E[RI]$ of random labelings by defining the adjusted Rand index as follows:

$ARI=\frac{RI-E[RI]}{\max(RI)-E[RI]}$

## Mutual Information based scores

Given the knowledge of the ground truth class assignments labels_true and our clustering algorithm assignments of the same samples labels_pred, the Mutual Information is a function that measures the agreement of the two assignments, ignoring permutations. Two different normalized versions of this measure are available, Normalized Mutual Information (NMI) and Adjusted Mutual Information (AMI). NMI is often used in the literature, while AMI was proposed more recently and is normalized against chance:

>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]

0.22504...


One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get the same score:

>>> labels_pred = [1, 1, 0, 0, 3, 3]
0.22504...


All, mutual_info_score, adjusted_mutual_info_score and normalized_mutual_info_score are symmetric: swapping the argument does not change the score. Thus they can be used as a consensus measure:

>>> metrics.adjusted_mutual_info_score(labels_pred, labels_true)
0.22504...


Perfect labeling is scored 1.0:

>>> labels_pred = labels_true[:]
1.0

>>> metrics.normalized_mutual_info_score(labels_true, labels_pred)
1.0

This is not true for mutual_info_score, which is therefore harder to judge:

>>> metrics.mutual_info_score(labels_true, labels_pred)
0.69...


Bad (e.g. independent labelings) have non-positive scores:

>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
-0.10526...


• Random (uniform) label assignments have a AMI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Mutual Information or the V-measure for instance).
• Upper bound of 1: Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement. Further, an AMI of exactly 1 indicates that the two label assignments are equal (with or without permutation).

### Drawbacks

• Contrary to inertia, MI-based measures require the knowledge of the ground truth classes while almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).
However MI-based measures can also be useful in purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection.
• NMI and MI are not adjusted against chance.

### Mathematical formulation

Assume two label assignments (of the same N objects), $U$ and $V$. Their entropy is the amount of uncertainty for a partition set, defined by:

$H(U)=-\sum_{i=1}^{|V|} P'(j)\log(P(i))$

where $P(i)=|U_i|/N$ is the probability that an object picked at random from $U$ falls into class $U_i$. Likewise for $V$:

$H(V)=-\sum_{j=1}^{|V|}P'(j)\log(P'(j))$

With $P'(j)=|V_j|/N$. The mutual information (MI) between $U$ and $V$ is calculated by:

$MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} P(i,j) \log(\frac{P(i,j)}{P(i)P'(j)})$

where $P(i,j)=|U_i \cap V_j|/N$ is the probability that an object picked at random falls into both classes $U_i$ and $V_j$.

It also can be expressed in set cardinality formulation:

$MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i \cap V_j|}{N} \log (\frac{N|U_i \cap V_j|}{|U_i||V_j|})$

The normalized mutual information is defined as

$NMI(U,V)=\frac{MI(U,V)}{mean(H(U),H(V))}$

This value of the mutual information and also the normalized variant is not adjusted for chance and will tend to increase as the number of different labels (clusters) increases, regardless of the actual amount of “mutual information” between the label assignments.

The expected value for the mutual information can be calculated using the following equation. In this equation, $a_i=|U_i|$ (the number of elements in $U_i$) and $b_j=|V_j|$ (the number of elements in $V_j$).

$E[MI(U,V)]=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \sum_{n_{ij}=(a_i+b_j-N)}^{\min(a_i,b_i)} \frac{n_{ij}}{N} \log(\frac{N \cdot n_{ij}}{a_i b_j}) \frac{a_i!b_j!(N-a_i)!(N-b_j)!}{N!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!(N-a_i-b_j+n_{ij})!}$

Using the expected value, the adjusted mutual information can then be calculated using a similar form to that of the adjusted Rand index:

$AMI=\frac{MI-E[MI]}{mean(H(U),H(V))-E[MI]}$

For normalized mutual information and adjusted mutual information, the normalizing value is typically some generalized mean of the entropies of each clustering. Various generalized means exist, and no firm rules exist for preferring one over the others. The decision is largely a field-by-field basis; for instance, in community detection, the arithmetic mean is most common. Each normalizing method provides “qualitatively similar behaviours” . In our implementation, this is controlled by the average_method parameter.

Vinh et al. (2010) named variants of NMI and AMI by their averaging method. Their ‘sqrt’ and ‘sum’ averages are the geometric and arithmetic means; we use these more broadly common names.

## Homogeneity, completeness and V-measure

Given the knowledge of the ground truth class assignments of the samples, it is possible to define some intuitive metric using conditional entropy analysis.

In particular Rosenberg and Hirschberg (2007) define the following two desirable objectives for any cluster assignment:

• homogeneity: each cluster contains only members of a single class.
• completeness: all members of a given class are assigned to the same cluster.

We can turn those concept as scores homogeneity_score and completeness_score. Both are bounded below by 0.0 and above by 1.0 (higher is better):

 >>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]

>>> metrics.homogeneity_score(labels_true, labels_pred)
0.66...

>>> metrics.completeness_score(labels_true, labels_pred)
0.42...


Their harmonic mean called V-measure is computed by v_measure_score:

>>> metrics.v_measure_score(labels_true, labels_pred)
0.51...


The V-measure is actually equivalent to the mutual information (NMI) discussed above, with the aggregation function being the arithmetic mean.

Homogeneity, completeness and V-measure can be computed at once using homogeneity_completeness_v_measure as follows:

>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
...
(0.66..., 0.42..., 0.51...)


The following clustering assignment is slightly better, since it is homogeneous but not complete:

>>> labels_pred = [0, 0, 0, 1, 2, 2]
>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
...
(1.0, 0.68..., 0.81...)


Note: v_measure_score is symmetric: it can be used to evaluate the agreement of two independent assignments on the same dataset.
This is not the case for completeness_score and homogeneity_score: both are bound by the relationship:

	homogeneity_score(a, b) == completeness_score(b, a)

• Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect score.
• Intuitive interpretation: clustering with bad V-measure can be qualitatively analyzed in terms of homogeneity and completeness to better feel what ‘kind’ of mistakes is done by the assignment.
• No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.

### Drawbacks

• The previously introduced metrics are not normalized with regards to random labeling: this means that depending on the number of samples, clusters and ground truth classes, a completely random labeling will not always yield the same values for homogeneity, completeness and hence v-measure. In particular random labeling won’t yield zero scores especially when the number of clusters is large.
This problem can safely be ignored when the number of samples is more than a thousand and the number of clusters is less than 10. For smaller sample sizes or larger number of clusters it is safer to use an adjusted index such as the Adjusted Rand Index (ARI).
• These metrics require the knowledge of the ground truth classes while almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).

### Mathematical formulation

Homogeneity and completeness scores are formally given by:

$h=1 -\frac{H(C|K)}{H(C)}$

$c=1- \frac{H(K|C)}{H(K)}$

where $H(C|K)$ is the conditional entropy of the classes given the cluster assignments and is given by:

$H(C|K)=-\sum_{c=1}^{|C|} \sum_{k=1}^{|K|} \frac{n_{c,k}}{n} \cdot \log(\frac{n_{c,k}}{n})$

and $H(C)$ is the entropy of the classes and is given by:

$H(C)=-\sum_{c=1}^{|C|} \frac{n_c}{n} \cdot \log(\frac{n_c}{n})$

with $n$ the total number of samples,$n_c$ and $n_k$ the number of samples respectively belonging to class $c$ and cluster $k$, and finally $n_{c,k}$ the number of samples from class $c$ assigned to cluster $k$.

The conditional entropy of clusters given class $H(K|C)$ and the entropy of clusters $H(K)$ are defined in a symmetric manner.

Rosenberg and Hirschberg further define V-measure as the harmonic mean of homogeneity and completeness:

$\nu = 2 \cdot \frac{h \cdot c}{h+c}$

### Fowlkes-Mallows scores

The Fowlkes-Mallows index (sklearn.metrics.fowlkes_mallows_score) can be used when the ground truth class assignments of the samples is known. The Fowlkes-Mallows score FMI is defined as the geometric mean of the pairwise precision and recall:

$FMI=\frac{TP}{\sqrt{(TP+FP)(TP+FN)}}$

Where TP is the number of True Positive (i.e. the number of pair of points that belong to the same clusters in both the true labels and the predicted labels), FP is the number of False Positive (i.e. the number of pair of points that belong to the same clusters in the true labels and not in the predicted labels) and FN is the number of False Negative (i.e the number of pair of points that belongs in the same clusters in the predicted labels and not in the true labels).

The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.

>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>>
>>> metrics.fowlkes_ mallows_score(labels_true, labels_pred)
0.47140...


One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get the same score:

 >>> labels_pred = [1, 1, 0, 0, 3, 3]

>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
0.47140...


Perfect labeling is scored 1.0:

>>> labels_pred = labels_true[:]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
1.0


Bad (e.g. independent labelings) have zero scores:

>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
0.0


• Random (uniform) label assignments have a FMI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Mutual Information or the V-measure for instance).
• Upper-bounded at 1: Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement. Further, values of exactly 0 indicate purely independent label assignments and a FMI of exactly 1 indicates that the two label assignments are equal (with or without permutation).
• No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.

### Drawbacks

• Contrary to inertia, FMI-based measures require the knowledge of the ground truth classes while almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).

## Silhouette Coefficient

If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:

• a: The mean distance between a sample and all other points in the same class.
• b: The mean distance between a sample and all other points in the next nearest cluster.

The Silhouette Coefficient s for a single sample is then given as:

$s=\frac{b-a}{\max(a,b)}$

The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for each sample.

>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> X = dataset.data
>>> y = dataset.target


In normal usage, the Silhouette Coefficient is applied to the results of a cluster analysis.

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.silhouette_score(X, labels, metric='euclidean')
...
0.55...

• The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.
• The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

### Drawbacks

• The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

## Calinski-Harabaz Index

If the ground truth labels are not known, the Calinski-Harabaz index (sklearn.metrics.calinski_harabaz_score) - also known as the Variance Ratio Criterion - can be used to evaluate the model, where a higher Calinski-Harabaz score relates to a model with better defined clusters.

For $k$ clusters, the Calinski-Harabaz score $s$ is given as the ratio of the between-clusters dispersion mean and the within-cluster dispersion:

$s(k)= \frac{Tr(B_k)}{Tr(W_k)} \times \frac{N-k}{k-1}$

where $B_k$ is the between group dispersion matrix and $W_k$ is the within-cluster dispersion matrix defined by:

$W_k=\sum_{q=1}^{k} \sum_{x \in C_q} (x-c_q)(x-c_q)^T$

$B_k=\sum_{q} n_q(c_q-c)(c_q-c)^T$

with $N$ be the number of points in our data,$C_q$ be the set of points in cluster $q$, $c_q$ be the center of cluster $q$, $c$ be the center of $E$, $n_q$ be the number of points in cluster $q$.

>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> X = dataset.data
>>> y = dataset.target


In normal usage, the Calinski-Harabaz index is applied to the results of a cluster analysis.

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.calinski_harabaz_score(X, labels)
561.62... 

• The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
• The score is fast to compute

### Drawbacks

• The Calinski-Harabaz index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

## Davies-Bouldin Index

If the ground truth labels are not known, the Davies-Bouldin index (sklearn.metrics.davies_bouldin_score) can be used to evaluate the model, where a lower Davies-Bouldin index relates to a model with better separation between the clusters.

The index is defined as the average similarity between each cluster $C_i$ for $i=1,\dots,k$ and its most similar one $C_j$. In the context of this index, similarity is defined as a measure $R_{ij}$ that trades off:

• $s_i$, the average distance between each point of cluster $i$ and the centroid of that cluster – also know as cluster diameter.
• $d_{ij}$, the distance between cluster centroids $i$ and $j$.

A simple choice to construct $R_{ij}$ so that it is nonnegative and symmetric is:

$R_{ij}=\frac{s_i+s_j}{d_{ij}}$

Then the Davies-Bouldin index is defined as:

$DB=\frac{1}{k} \sum_{i=1}^{k} \max_{i \neq j} R_{ij}$

Zero is the lowest possible score. Values closer to zero indicate a better partition.

In normal usage, the Davies-Bouldin index is applied to the results of a cluster analysis as follows:

>>> from sklearn import datasets
>>> X = iris.data
>>> from sklearn.c luster import KMeans
>>> from sklearn.metrics import davies_bouldin_score
>>> kmeans = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans.labels_
>>> davies_bouldin_score(X, labels)
0.6619...


• The computation of Davies-Bouldin is simpler than that of Silhouette scores.
• The index is computed only quantities and features inherent to the dataset.

### Drawbacks

• The Davies-Boulding index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.
• The usage of centroid distance limits the distance metric to Euclidean space.
• A good value reported by this method does not imply the best information retrieval.

## Contingency Matrix

Contingency matrix (sklearn.metrics.cluster.contingency_matrix) reports the intersection cardinality for every true/predicted cluster pair. The contingency matrix provides sufficient statistics for all clustering metrics where the samples are independent and identically distributed and one doesn’t need to account for some instances not being clustered.

Here is an example:

>>> from sklearn.metrics.cluster import contingency_matrix
>>> x = ["a", "a", "a", "b", "b", "b"]
>>> y = [0, 0, 1, 1, 2, 2]
>>> contingency_matrix(x, y)
array([ [2, 1, 0],
[0, 1, 2] ])


The first row of output array indicates that there are three samples whose true cluster is “a”. Of them, two are in predicted cluster 0, one is in 1, and none is in 2. And the second row indicates that there are three samples whose true cluster is “b”. Of them, none is in predicted cluster 0, one is in 1 and two are in 2.

A confusion matrix for classification is a square contingency matrix where the order of rows and columns correspond to a list of classes.