Manifold learning
Manifold learning is an approach to nonlinear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.
Introduction
Highdimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to show the inherent structure of the data, equivalent highdimensional plots are much less intuitive. To aid visualization of the structure of a dataset, the dimension must be reduced in some way.
The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired. In a random projection, it is likely that the more interesting structure within the data will be lost.
To address this concern, a number of supervised and unsupervised linear dimensionality reduction frameworks have been designed, such as Principal Component Analysis (PCA), Independent Component Analysis, Linear Discriminant Analysis, and others. These algorithms define specific rubrics to choose an “interesting” linear projection of the data. These methods can be powerful, but often miss important nonlinear structure in the data.
Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to nonlinear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the highdimensional structure of the data from the data itself, without the use of predetermined classifications.
The manifold learning implementations available in scikitlearn are summarized below
Isomap
One of the earliest approaches to manifold learning is the Isomap algorithm, short for Isometric Mapping. Isomap can be viewed as an extension of Multidimensional Scaling (MDS) or Kernel PCA. Isomap seeks a lowerdimensional embedding which maintains geodesic distances between all points. Isomap can be performed with the object Isomap
.
Complexity
The Isomap algorithm comprises three stages:

Nearest neighbor search. Isomap uses
sklearn.neighbors.BallTree
for efficient neighbor search. The cost is approximately , for nearest neighbors of points in dimensions. 
Shortestpath graph search. The most efficient known algorithms for this are Dijkstra’s Algorithm, which is approximately , or the FloydWarshall algorithm, which is . The algorithm can be selected by the user with the
path_method
keyword ofIsomap
. If unspecified, the code attempts to choose the best algorithm for the input data. 
Partial eigenvalue decomposition. The embedding is encoded in the eigenvectors corresponding to the largest eigenvalues of the isomap kernel. For a dense solver, the cost is approximately . This cost can often be improved using the
ARPACK
solver. The eigensolver can be specified by the user with thepath_method
keyword ofIsomap
. If unspecified, the code attempts to choose the best algorithm for the input data.
The overall complexity of Isomap is .
 : number of training data points
 : input dimension
 : number of nearest neighbors
 : output dimension
Locally Linear Embedding
Locally linear embedding (LLE) seeks a lowerdimensional projection of the data which preserves distances within local neighborhoods. It can be thought of as a series of local Principal Component Analyses which are globally compared to find the best nonlinear embedding.
Locally linear embedding can be performed with function locally_linear_embedding
or its objectoriented counterpart LocallyLinearEmbedding
.
Complexity
The standard LLE algorithm comprises three stages:
 Nearest Neighbors Search. See discussion under Isomap above.
 Weight Matrix Construction. . The construction of the LLE weight matrix involves the solution of a linear equation for each of the local neighborhoods
 Partial Eigenvalue Decomposition. See discussion under Isomap above.
The overall complexity of standard LLE is .
 : number of training data points
 : input dimension
 : number of nearest neighbors
 : output dimension
Modified Locally Linear Embedding
One wellknown issue with LLE is the regularization problem. When the number of neighbors is greater than the number of input dimensions, the matrix defining each local neighborhood is rankdeficient. To address this, standard LLE applies an arbitrary regularization parameter , which is chosen relative to the trace of the local weight matrix. Though it can be shown formally that as , the solution converges to the desired embedding, there is no guarantee that the optimal solution will be found for . This problem manifests itself in embeddings which distort the underlying geometry of the manifold.
One method to address the regularization problem is to use multiple weight vectors in each neighborhood. This is the essence of modified locally linear embedding (MLLE). MLLE can be performed with function locally_linear_embedding
or its objectoriented counterpart LocallyLinearEmbedding
, with the keyword method = 'modified'
. It requires n_neighbors > n_components
.
Complexity
The MLLE algorithm comprises three stages:
 Nearest Neighbors Search. Same as standard LLE
 Weight Matrix Construction. Approximately . The first term is exactly equivalent to that of standard LLE. The second term has to do with constructing the weight matrix from multiple weights. In practice, the added cost of constructing the MLLE weight matrix is relatively small compared to the cost of steps 1 and 3.
 Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of MLLE is .
 : number of training data points
 : input dimension
 : number of nearest neighbors
 : output dimension
Hessian Eigenmapping
Hessian Eigenmapping (also known as Hessianbased LLE: HLLE) is another method of solving the regularization problem of LLE. It revolves around a hessianbased quadratic form at each neighborhood which is used to recover the locally linear structure. Though other implementations note its poor scaling with data size, sklearn
implements some algorithmic improvements which make its cost comparable to that of other LLE variants for small output dimension. HLLE can be performed with function locally_linear_embedding
or its objectoriented counterpart LocallyLinearEmbedding
, with the keyword method = 'hessian'
. It requires n_neighbors > n_components * (n_components + 3) / 2
.
Complexity
The HLLE algorithm comprises three stages:
 Nearest Neighbors Search. Same as standard LLE
 Weight Matrix Construction. Approximately . The first term reflects a similar cost to that of standard LLE. The second term comes from a QR decomposition of the local hessian estimator.
 Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of standard HLLE is .
 : number of training data points
 : input dimension
 : number of nearest neighbors
 : output dimension
Spectral Embedding
Spectral Embedding is an approach to calculating a nonlinear embedding. Scikitlearn implements Laplacian Eigenmaps, which finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian. The graph generated can be considered as a discrete approximation of the low dimensional manifold in the high dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low dimensional space, preserving local distances. Spectral embedding can be performed with the function spectral_embedding
or its objectoriented counterpart SpectralEmbedding
.
Complexity
The Spectral Embedding (Laplacian Eigenmaps) algorithm comprises three stages:
 Weighted Graph Construction. Transform the raw input data into graph representation using affinity (adjacency) matrix representation.
 Graph Laplacian Construction. unnormalized Graph Laplacian is constructed as for .
 Partial Eigenvalue Decomposition. Eigenvalue decomposition is done on graph Laplacian
The overall complexity of spectral embedding is .
 : number of training data points
 : input dimension
 : number of nearest neighbors
 : output dimension
Local Tangent Space Alignment
Though not technically a variant of LLE, Local tangent space alignment (LTSA) is algorithmically similar enough to LLE that it can be put in this category. Rather than focusing on preserving neighborhood distances as in LLE, LTSA seeks to characterize the local geometry at each neighborhood via its tangent space, and performs a global optimization to align these local tangent spaces to learn the embedding. LTSA can be performed with function locally_linear_embedding
or its objectoriented counterpart LocallyLinearEmbedding
, with the keyword method = 'ltsa'
.
Complexity
The LTSA algorithm comprises three stages:
 Nearest Neighbors Search. Same as standard LLE
 Weight Matrix Construction. Approximately . The first term reflects a similar cost to that of standard LLE.
 Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of standard LTSA is .
 : number of training data points
 : input dimension
 : number of nearest neighbors
 : output dimension
Multidimensional Scaling (MDS)
Multidimensional scaling (MDS
) seeks a lowdimensional representation of the data in which the distances respect well the distances in the original highdimensional space.
In general, is a technique used for analyzing similarity or dissimilarity data. MDS
attempts to model similarity or dissimilarity data as distances in a geometric spaces. The data can be ratings of similarity between objects, interaction frequencies of molecules, or trade indices between countries.
There exists two types of MDS algorithm: metric and non metric. In the scikitlearn, the class MDS
implements both. In Metric MDS
, the input similarity matrix arises from a metric (and thus respects the triangular inequality), the distances between output two points are then set to be as close as possible to the similarity or dissimilarity data. In the nonmetric version, the algorithms will try to preserve the order of the distances, and hence seek for a monotonic relationship between the distances in the embedded space and the similarities/dissimilarities.
Let be the similarity matrix, and the coordinates of the input points. Disparities are transformation of the similarities chosen in some optimal ways. The objective, called the stress, is then defined by
Metric MDS
The simplest metric MDS model, called absolute MDS, disparities are defined by . With absolute MDS, the value should then correspond exactly to the distance between point and in the embedding point.
Most commonly, disparities are set to .
Nonmetric MDS
Non metric
focuses on the ordination of the data. If , then the embedding should enforce . A simple algorithm to enforce that is to use a monotonic regression of on , yielding disparities in the same order as .MDS
A trivial solution to this problem is to set all the points on the origin. In order to avoid that, the disparities are normalized.
tdistributed Stochastic Neighbor Embedding (tSNE)
tSNE (TSNE
) converts affinities of data points to probabilities. The affinities in the original space are represented by Gaussian joint probabilities and the affinities in the embedded space are represented by Student’s tdistributions. This allows tSNE to be particularly sensitive to local structure and has a few other advantages over existing techniques:
 Revealing the structure at many scales on a single map
 Revealing data that lie in multiple, different, manifolds or clusters
 Reducing the tendency to crowd points together at the center
While Isomap, LLE and variants are best suited to unfold a single continuous low dimensional manifold, tSNE will focus on the local structure of the data and will tend to extract clustered local groups of samples as highlighted on the Scurve example. This ability to group samples based on the local structure might be beneficial to visually disentangle a dataset that comprises several manifolds at once as is the case in the digits dataset.
The KullbackLeibler (KL) divergence of the joint probabilities in the original space and the embedded space will be minimized by gradient descent. Note that the KL divergence is not convex, i.e. multiple restarts with different initializations will end up in local minima of the KL divergence. Hence, it is sometimes useful to try different seeds and select the embedding with the lowest KL divergence.
The disadvantages to using tSNE are roughly:
 tSNE is computationally expensive, and can take several hours on millionsample datasets where PCA will finish in seconds or minutes
 The BarnesHut tSNE method is limited to two or three dimensional embeddings.
 The algorithm is stochastic and multiple restarts with different seeds can yield different embeddings. However, it is perfectly legitimate to pick the embedding with the least error.
 Global structure is not explicitly preserved. This is problem is mitigated by initializing points with PCA (using init=’pca’).
Optimizing tSNE
The main purpose of tSNE is visualization of highdimensional data. Hence, it works best when the data will be embedded on two or three dimensions.
Optimizing the KL divergence can be a little bit tricky sometimes. There are five parameters that control the optimization of tSNE and therefore possibly the quality of the resulting embedding:
 perplexity
 early exaggeration factor
 learning rate
 maximum number of iterations
 angle (not used in the exact method)
The perplexity is defined as where is the Shannon entropy of the conditional probability distribution. The perplexity of a sided die is , so that is effectively the number of nearest neighbors tSNE considers when generating the conditional probabilities. Larger perplexities lead to more nearest neighbors and less sensitive to small structure. Conversely a lower perplexity considers a smaller number of neighbors, and thus ignores more global information in favour of the local neighborhood. As dataset sizes get larger more points will be required to get a reasonable sample of the local neighborhood, and hence larger perplexities may be required. Similarly noisier datasets will require larger perplexity values to encompass enough local neighbors to see beyond the background noise.
The maximum number of iterations is usually high enough and does not need any tuning. The optimization consists of two phases: the early exaggeration phase and the final optimization. During early exaggeration the joint probabilities in the original space will be artificially increased by multiplication with a given factor. Larger factors result in larger gaps between natural clusters in the data. If the factor is too high, the KL divergence could increase during this phase. Usually it does not have to be tuned. A critical parameter is the learning rate. If it is too low gradient descent will get stuck in a bad local minimum. If it is too high the KL divergence will increase during optimization. More tips can be found in Laurens van der Maaten’s FAQ (see references). The last parameter, angle, is a tradeoff between performance and accuracy. Larger angles imply that we can approximate larger regions by a single point, leading to better speed but less accurate results.
“How to Use tSNE Effectively” provides a good discussion of the effects of the various parameters, as well as interactive plots to explore the effects of different parameters.
BarnesHut tSNE
The BarnesHut tSNE that has been implemented here is usually much slower than other manifold learning algorithms. The optimization is quite difficult and the computation of the gradient is , where is the number of output dimensions and is the number of samples. The BarnesHut method improves on the exact method where tSNE complexity is , but has several other notable differences:
 The BarnesHut implementation only works when the target dimensionality is 3 or less. The 2D case is typical when building visualizations.

BarnesHut only works with dense input data. Sparse data matrices can only be embedded with the exact method or can be approximated by a dense low rank projection for instance using
sklearn.decomposition.TruncatedSVD
 BarnesHut is an approximation of the exact method. The approximation is parameterized with the angle parameter, therefore the angle parameter is unused when method=”exact”
 BarnesHut is significantly more scalable. BarnesHut can be used to embed hundred of thousands of data points while the exact method can handle thousands of samples before becoming computationally intractable
For visualization purpose (which is the main use case of tSNE), using the BarnesHut method is strongly recommended. The exact tSNE method is useful for checking the theoretically properties of the embedding possibly in higher dimensional space but limit to small datasets due to computational constraints.
Also note that the digits labels roughly match the natural grouping found by tSNE while the linear 2D projection of the PCA model yields a representation where label regions largely overlap. This is a strong clue that this data can be well separated by non linear methods that focus on the local structure (e.g. an SVM with a Gaussian RBF kernel). However, failing to visualize well separated homogeneously labeled groups with tSNE in 2D does not necessarily imply that the data cannot be correctly classified by a supervised model. It might be the case that 2 dimensions are not low enough to accurately represents the internal structure of the data.
Tips on practical use
 Make sure the same scale is used over all features. Because manifold learning methods are based on a nearestneighbor search, the algorithm may perform poorly otherwise. See StandardScaler for convenient ways of scaling heterogeneous data.

The reconstruction error computed by each routine can be used to choose the optimal output dimension. For a dimensional manifold embedded in a dimensional parameter space, the reconstruction error will decrease as
n_components
is increased untiln_components == d
.  Note that noisy data can “shortcircuit” the manifold, in essence acting as a bridge between parts of the manifold that would otherwise be wellseparated. Manifold learning on noisy and/or incomplete data is an active area of research.

Certain input configurations can lead to singular weight matrices, for example when more than two points in the dataset are identical, or when the data is split into disjointed groups. In this case,
solver='arpack'
will fail to find the null space. The easiest way to address this is to usesolver='dense'
which will work on a singular matrix, though it may be very slow depending on the number of input points. Alternatively, one can attempt to understand the source of the singularity: if it is due to disjoint sets, increasingn_neighbors
may help. If it is due to identical points in the dataset, removing these points may help.