UMAP is a fairly flexible non-linear dimension reduction algorithm. Itseeks to learn the manifold structure of your data and find a lowdimensional embedding that preserves the essential topological structureof that manifold. In this notebook we will generate some visualisable4-dimensional data, demonstrate how to use UMAP to provide a2-dimensional representation of it, and then look at how various UMAPparameters can impact the resulting embedding. This documentation isbased on the work of Philippe Rivière for visionscarto.net.
To start we’ll need some basic libraries. First numpy
will be neededfor basic array manipulation. Since we will be visualising the resultswe will need matplotlib
and seaborn
. Finally we will needumap
for doing the dimension reduction itself.
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dimport seaborn as snsimport umap%matplotlib inline
sns.set(style='white', context='poster', rc={'figure.figsize':(14,10)})
Next we will need some data to embed into a lower dimensionalrepresentation. To make the 4-dimensional data “visualisable” we willgenerate data uniformly at random from a 4-dimensional cube such that wecan interpret a sample as a tuple of (R,G,B,a) values specifying a color(and translucency). Thus when we plot low dimensional representationseach point can be colored according to its 4-dimensional value. For this wecan use numpy
. We will fix a random seed for the sake ofconsistency.
np.random.seed(42)data = np.random.rand(800, 4)
Now we need to find a low dimensional representation of the data. As inthe Basic Usage documentation, we can do this by using thefit_transform() method on a UMAP object.
fit = umap.UMAP()%time u = fit.fit_transform(data)
CPU times: user 7.73 s, sys: 211 ms, total: 7.94 sWall time: 6.8 s
The resulting value u
is a 2-dimensional representation of the data.We can visualise the result by using matplotlib
to draw a scatterplot of u
. We can color each point of the scatter plot by theassociated 4-dimensional color from the source data.
plt.scatter(u[:,0], u[:,1], c=data)plt.title('UMAP embedding of random colours');
![Basic UMAP Parameters — umap 0.5 documentation (1) Basic UMAP Parameters — umap 0.5 documentation (1)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_8_1.png)
As you can see the result is that the data is placed in 2-dimensionalspace such that points that were nearby in 4-dimensional space (i.e.are similar colors) are kept close together. Since we drew a randomselection of points in the color cube there is a certain amount ofinduced structure from where the random points happened to clump up incolor space.
UMAP has several hyperparameters that can have a significant impact onthe resulting embedding. In this notebook we will be covering the fourmajor ones:
n_neighbors
min_dist
n_components
metric
Each of these parameters has a distinct effect, and we will look at eachin turn. To make exploration simpler we will first write a short utilityfunction that can fit the data with UMAP given a set of parameterchoices, and plot the result.
def draw_umap(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean', title=''): fit = umap.UMAP( n_neighbors=n_neighbors, min_dist=min_dist, n_components=n_components, metric=metric ) u = fit.fit_transform(data); fig = plt.figure() if n_components == 1: ax = fig.add_subplot(111) ax.scatter(u[:,0], range(len(u)), c=data) if n_components == 2: ax = fig.add_subplot(111) ax.scatter(u[:,0], u[:,1], c=data) if n_components == 3: ax = fig.add_subplot(111, projection='3d') ax.scatter(u[:,0], u[:,1], u[:,2], c=data, s=100) plt.title(title, fontsize=18)
n_neighbors
This parameter controls how UMAP balances local versus global structurein the data. It does this by constraining the size of the localneighborhood UMAP will look at when attempting to learn the manifoldstructure of the data. This means that low values of n_neighbors
will force UMAP to concentrate on very local structure (potentially tothe detriment of the big picture), while large values will push UMAP tolook at larger neighborhoods of each point when estimating the manifoldstructure of the data, losing fine detail structure for the sake ofgetting the broader of the data.
We can see that in practice by fitting our dataset with UMAP using arange of n_neighbors
values. The default value of n_neighbors
for UMAP (as used above) is 15, but we will look at values ranging from2 (a very local view of the manifold) up to 200 (a quarter of the data).
for n in (2, 5, 10, 20, 50, 100, 200): draw_umap(n_neighbors=n, title='n_neighbors = {}'.format(n))
![Basic UMAP Parameters — umap 0.5 documentation (2) Basic UMAP Parameters — umap 0.5 documentation (2)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_13_1.png)
![Basic UMAP Parameters — umap 0.5 documentation (3) Basic UMAP Parameters — umap 0.5 documentation (3)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_13_2.png)
![Basic UMAP Parameters — umap 0.5 documentation (4) Basic UMAP Parameters — umap 0.5 documentation (4)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_13_3.png)
![Basic UMAP Parameters — umap 0.5 documentation (5) Basic UMAP Parameters — umap 0.5 documentation (5)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_13_4.png)
![Basic UMAP Parameters — umap 0.5 documentation (6) Basic UMAP Parameters — umap 0.5 documentation (6)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_13_5.png)
![Basic UMAP Parameters — umap 0.5 documentation (7) Basic UMAP Parameters — umap 0.5 documentation (7)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_13_6.png)
![Basic UMAP Parameters — umap 0.5 documentation (8) Basic UMAP Parameters — umap 0.5 documentation (8)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_13_7.png)
With a value of n_neighbors=2
we see that UMAP merely glues togethersmall chains, but due to the narrow/local view, fails to see how thoseconnect together. It also leaves many different components (and evensingleton points). This represents the fact that from a fine detailpoint of view the data is very disconnected and scattered throughout thespace.
As n_neighbors
is increased UMAP manages to see more of the overallstructure of the data, gluing more components together, and bettercoverying the broader structure of the data. By the stage ofn_neighbors=20
we have a fairly good overall view of the datashowing how the various colors interelate to each other over the wholedataset.
As n_neighbors
increases further more and more focus in placed onthe overall structure of the data. This results in, withn_neighbors=200
a plot where the overall structure (blues, greens,and reds; high luminance versus low) is well captured, but at the lossof some of the finer local structure (individual colors are no longernecessarily immediately near their closest color match).
This effect well exemplifies the local/global tradeoff provided byn_neighbors
.
min_dist
The min_dist
parameter controls how tightly UMAP is allowed to packpoints together. It, quite literally, provides the minimum distanceapart that points are allowed to be in the low dimensionalrepresentation. This means that low values of min_dist
will resultin clumpier embeddings. This can be useful if you are interested inclustering, or in finer topological structure. Larger values ofmin_dist
will prevent UMAP from packing points together and willfocus on the preservation of the broad topological structureinstead.
The default value for min_dist
(as used above) is 0.1. We will lookat a range of values from 0.0 through to 0.99.
for d in (0.0, 0.1, 0.25, 0.5, 0.8, 0.99): draw_umap(min_dist=d, title='min_dist = {}'.format(d))
![Basic UMAP Parameters — umap 0.5 documentation (9) Basic UMAP Parameters — umap 0.5 documentation (9)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_16_1.png)
![Basic UMAP Parameters — umap 0.5 documentation (10) Basic UMAP Parameters — umap 0.5 documentation (10)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_16_2.png)
![Basic UMAP Parameters — umap 0.5 documentation (11) Basic UMAP Parameters — umap 0.5 documentation (11)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_16_3.png)
![Basic UMAP Parameters — umap 0.5 documentation (12) Basic UMAP Parameters — umap 0.5 documentation (12)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_16_4.png)
![Basic UMAP Parameters — umap 0.5 documentation (13) Basic UMAP Parameters — umap 0.5 documentation (13)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_16_5.png)
![Basic UMAP Parameters — umap 0.5 documentation (14) Basic UMAP Parameters — umap 0.5 documentation (14)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_16_6.png)
Here we see that with min_dist=0.0
UMAP manages to find smallconnected components, clumps and strings in the data, and emphasisesthese features in the resulting embedding. As min_dist
is increasedthese structures are pushed apart into softer more general features,providing a better overarching view of the data at the loss of the moredetailed topological structure.
n_components
As is standard for many scikit-learn
dimension reduction algorithmsUMAP provides a n_components
parameter option that allows the userto determine the dimensionality of the reduced dimension space we willbe embedding the data into. Unlike some other visualisation algorithmssuch as t-SNE, UMAP scales well in the embedding dimension, so you can use itfor more than just visualisation in 2- or 3-dimensions.
For the purposes of this demonstration (so that we can see the effectsof the parameter) we will only be looking at 1-dimensional and3-dimensional embeddings, which we have some hope of visualizing.
First of all we will set n_components
to 1, forcing UMAP to embedthe data in a line. For visualisation purposes we will randomlydistribute the data on the y-axis to provide some separation betweenpoints.
draw_umap(n_components=1, title='n_components = 1')
![Basic UMAP Parameters — umap 0.5 documentation (15) Basic UMAP Parameters — umap 0.5 documentation (15)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_19_1.png)
Now we will try n_components=3
. For visualisation we will make useof matplotlib
’s basic 3-dimensional plotting.
draw_umap(n_components=3, title='n_components = 3')
/opt/anaconda3/envs/umap_dev/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt return distances if squared else np.sqrt(distances, out=distances)
![Basic UMAP Parameters — umap 0.5 documentation (16) Basic UMAP Parameters — umap 0.5 documentation (16)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_21_1.png)
Here we can see that with more dimensions in which to work UMAP has aneasier time separating out the colors in a way that respects thetopological structure of the data.
As mentioned, there is really no requirement to stop at n_components=3
. If you are interested in (density based) clustering, or othermachine learning techniques, it can be beneficial to pick a largerembedding dimension (say 10, or 50) closer to the the dimension of theunderlying manifold on which your data lies.
metric
The final UMAP parameter we will be considering in this notebook is themetric
parameter. This controls how distance is computed in theambient space of the input data. By default UMAP supports a wide varietyof metrics, including:
Minkowski style metrics
euclidean
manhattan
chebyshev
minkowski
Miscellaneous spatial metrics
canberra
braycurtis
haversine
Normalized spatial metrics
mahalanobis
wminkowski
seuclidean
Angular and correlation metrics
cosine
correlation
Metrics for binary data
hamming
jaccard
dice
russellrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule
Any of which can be specified by setting metric='<metric name>'
; forexample to use cosine distance as the metric you would usemetric='cosine'
.
UMAP offers more than this however – it supports custom user definedmetrics as long as those metrics can be compiled in nopython
mode bynumba. For this notebook we will be looking at such custom metrics. Todefine such metrics we’ll need numba …
import numba
For our first custom metric we’ll define the distance to be the absolutevalue of difference in the red channel.
@numba.njit()def red_channel_dist(a,b): return np.abs(a[0] - b[0])
To get more adventurous it will be useful to have some colorspaceconversion – to keep things simple we’ll just use HSL formulas toextract the hue, saturation, and lightness from an (R,G,B) tuple.
@numba.njit()def hue(r, g, b): cmax = max(r, g, b) cmin = min(r, g, b) delta = cmax - cmin if cmax == r: return ((g - b) / delta) % 6 elif cmax == g: return ((b - r) / delta) + 2 else: return ((r - g) / delta) + 4@numba.njit()def lightness(r, g, b): cmax = max(r, g, b) cmin = min(r, g, b) return (cmax + cmin) / 2.0@numba.njit()def saturation(r, g, b): cmax = max(r, g, b) cmin = min(r, g, b) chroma = cmax - cmin light = lightness(r, g, b) if light == 1: return 0 else: return chroma / (1 - abs(2*light - 1))
With that in hand we can define three extra distances. The first simplymeasures the difference in hue, the second measures the euclideandistance in a combined saturation and lightness space, while the thirdmeasures distance in the full HSL space.
@numba.njit()def hue_dist(a, b): diff = (hue(a[0], a[1], a[2]) - hue(b[0], b[1], b[2])) % 6 if diff < 0: return diff + 6 else: return diff@numba.njit()def sl_dist(a, b): a_sat = saturation(a[0], a[1], a[2]) b_sat = saturation(b[0], b[1], b[2]) a_light = lightness(a[0], a[1], a[2]) b_light = lightness(b[0], b[1], b[2]) return (a_sat - b_sat)**2 + (a_light - b_light)**2@numba.njit()def hsl_dist(a, b): a_sat = saturation(a[0], a[1], a[2]) b_sat = saturation(b[0], b[1], b[2]) a_light = lightness(a[0], a[1], a[2]) b_light = lightness(b[0], b[1], b[2]) a_hue = hue(a[0], a[1], a[2]) b_hue = hue(b[0], b[1], b[2]) return (a_sat - b_sat)**2 + (a_light - b_light)**2 + (((a_hue - b_hue) % 6) / 6.0)
With such custom metrics in hand we can get UMAP to embed the data usingthose metrics to measure the distance between our input data points. Notethat numba
provides significant flexibility in what we can do indefining distance functions. Despite this we retain the high performancewe expect from UMAP even using such custom functions.
for m in ("euclidean", red_channel_dist, sl_dist, hue_dist, hsl_dist): name = m if type(m) is str else m.__name__ draw_umap(n_components=2, metric=m, title='metric = {}'.format(name))
![Basic UMAP Parameters — umap 0.5 documentation (17) Basic UMAP Parameters — umap 0.5 documentation (17)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_32_1.png)
![Basic UMAP Parameters — umap 0.5 documentation (18) Basic UMAP Parameters — umap 0.5 documentation (18)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_32_2.png)
![Basic UMAP Parameters — umap 0.5 documentation (19) Basic UMAP Parameters — umap 0.5 documentation (19)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_32_3.png)
![Basic UMAP Parameters — umap 0.5 documentation (20) Basic UMAP Parameters — umap 0.5 documentation (20)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_32_4.png)
![Basic UMAP Parameters — umap 0.5 documentation (21) Basic UMAP Parameters — umap 0.5 documentation (21)](https://i0.wp.com/umap-learn.readthedocs.io/_images/parameters_32_5.png)
And here we can see the effects of the metrics quite clearly. The purered channel correctly sees the data as living on a one dimensionalmanifold, the hue metric interprets the data as living in a circle, andthe HSL metric fattens out the circle according to the saturation andlightness. This provides a reasonable demonstration of the power andflexibility of UMAP in understanding the underlying topology of data,and finding a suitable low dimensional representation of that topology.