Basic UMAP Parameters — umap 0.5 documentation (2024)

UMAP is a fairly flexible non-linear dimension reduction algorithm. Itseeks to learn the manifold structure of your data and find a lowdimensional embedding that preserves the essential topological structureof that manifold. In this notebook we will generate some visualisable4-dimensional data, demonstrate how to use UMAP to provide a2-dimensional representation of it, and then look at how various UMAPparameters can impact the resulting embedding. This documentation isbased on the work of Philippe Rivière for visionscarto.net.

To start we’ll need some basic libraries. First numpy will be neededfor basic array manipulation. Since we will be visualising the resultswe will need matplotlib and seaborn. Finally we will needumap for doing the dimension reduction itself.

import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dimport seaborn as snsimport umap%matplotlib inline
sns.set(style='white', context='poster', rc={'figure.figsize':(14,10)})

Next we will need some data to embed into a lower dimensionalrepresentation. To make the 4-dimensional data “visualisable” we willgenerate data uniformly at random from a 4-dimensional cube such that wecan interpret a sample as a tuple of (R,G,B,a) values specifying a color(and translucency). Thus when we plot low dimensional representationseach point can be colored according to its 4-dimensional value. For this wecan use numpy. We will fix a random seed for the sake ofconsistency.

np.random.seed(42)data = np.random.rand(800, 4)

Now we need to find a low dimensional representation of the data. As inthe Basic Usage documentation, we can do this by using thefit_transform() method on a UMAP object.

fit = umap.UMAP()%time u = fit.fit_transform(data)
CPU times: user 7.73 s, sys: 211 ms, total: 7.94 sWall time: 6.8 s

The resulting value u is a 2-dimensional representation of the data.We can visualise the result by using matplotlib to draw a scatterplot of u. We can color each point of the scatter plot by theassociated 4-dimensional color from the source data.

plt.scatter(u[:,0], u[:,1], c=data)plt.title('UMAP embedding of random colours');
Basic UMAP Parameters — umap 0.5 documentation (1)

As you can see the result is that the data is placed in 2-dimensionalspace such that points that were nearby in 4-dimensional space (i.e.are similar colors) are kept close together. Since we drew a randomselection of points in the color cube there is a certain amount ofinduced structure from where the random points happened to clump up incolor space.

UMAP has several hyperparameters that can have a significant impact onthe resulting embedding. In this notebook we will be covering the fourmajor ones:

Each of these parameters has a distinct effect, and we will look at eachin turn. To make exploration simpler we will first write a short utilityfunction that can fit the data with UMAP given a set of parameterchoices, and plot the result.

def draw_umap(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean', title=''): fit = umap.UMAP( n_neighbors=n_neighbors, min_dist=min_dist, n_components=n_components, metric=metric ) u = fit.fit_transform(data); fig = plt.figure() if n_components == 1: ax = fig.add_subplot(111) ax.scatter(u[:,0], range(len(u)), c=data) if n_components == 2: ax = fig.add_subplot(111) ax.scatter(u[:,0], u[:,1], c=data) if n_components == 3: ax = fig.add_subplot(111, projection='3d') ax.scatter(u[:,0], u[:,1], u[:,2], c=data, s=100) plt.title(title, fontsize=18)

n_neighbors

This parameter controls how UMAP balances local versus global structurein the data. It does this by constraining the size of the localneighborhood UMAP will look at when attempting to learn the manifoldstructure of the data. This means that low values of n_neighborswill force UMAP to concentrate on very local structure (potentially tothe detriment of the big picture), while large values will push UMAP tolook at larger neighborhoods of each point when estimating the manifoldstructure of the data, losing fine detail structure for the sake ofgetting the broader of the data.

We can see that in practice by fitting our dataset with UMAP using arange of n_neighbors values. The default value of n_neighborsfor UMAP (as used above) is 15, but we will look at values ranging from2 (a very local view of the manifold) up to 200 (a quarter of the data).

for n in (2, 5, 10, 20, 50, 100, 200): draw_umap(n_neighbors=n, title='n_neighbors = {}'.format(n))
Basic UMAP Parameters — umap 0.5 documentation (2)Basic UMAP Parameters — umap 0.5 documentation (3)Basic UMAP Parameters — umap 0.5 documentation (4)Basic UMAP Parameters — umap 0.5 documentation (5)Basic UMAP Parameters — umap 0.5 documentation (6)Basic UMAP Parameters — umap 0.5 documentation (7)Basic UMAP Parameters — umap 0.5 documentation (8)

With a value of n_neighbors=2 we see that UMAP merely glues togethersmall chains, but due to the narrow/local view, fails to see how thoseconnect together. It also leaves many different components (and evensingleton points). This represents the fact that from a fine detailpoint of view the data is very disconnected and scattered throughout thespace.

As n_neighbors is increased UMAP manages to see more of the overallstructure of the data, gluing more components together, and bettercoverying the broader structure of the data. By the stage ofn_neighbors=20 we have a fairly good overall view of the datashowing how the various colors interelate to each other over the wholedataset.

As n_neighbors increases further more and more focus in placed onthe overall structure of the data. This results in, withn_neighbors=200 a plot where the overall structure (blues, greens,and reds; high luminance versus low) is well captured, but at the lossof some of the finer local structure (individual colors are no longernecessarily immediately near their closest color match).

This effect well exemplifies the local/global tradeoff provided byn_neighbors.

min_dist

The min_dist parameter controls how tightly UMAP is allowed to packpoints together. It, quite literally, provides the minimum distanceapart that points are allowed to be in the low dimensionalrepresentation. This means that low values of min_dist will resultin clumpier embeddings. This can be useful if you are interested inclustering, or in finer topological structure. Larger values ofmin_dist will prevent UMAP from packing points together and willfocus on the preservation of the broad topological structureinstead.

The default value for min_dist (as used above) is 0.1. We will lookat a range of values from 0.0 through to 0.99.

for d in (0.0, 0.1, 0.25, 0.5, 0.8, 0.99): draw_umap(min_dist=d, title='min_dist = {}'.format(d))
Basic UMAP Parameters — umap 0.5 documentation (9)Basic UMAP Parameters — umap 0.5 documentation (10)Basic UMAP Parameters — umap 0.5 documentation (11)Basic UMAP Parameters — umap 0.5 documentation (12)Basic UMAP Parameters — umap 0.5 documentation (13)Basic UMAP Parameters — umap 0.5 documentation (14)

Here we see that with min_dist=0.0 UMAP manages to find smallconnected components, clumps and strings in the data, and emphasisesthese features in the resulting embedding. As min_dist is increasedthese structures are pushed apart into softer more general features,providing a better overarching view of the data at the loss of the moredetailed topological structure.

n_components

As is standard for many scikit-learn dimension reduction algorithmsUMAP provides a n_components parameter option that allows the userto determine the dimensionality of the reduced dimension space we willbe embedding the data into. Unlike some other visualisation algorithmssuch as t-SNE, UMAP scales well in the embedding dimension, so you can use itfor more than just visualisation in 2- or 3-dimensions.

For the purposes of this demonstration (so that we can see the effectsof the parameter) we will only be looking at 1-dimensional and3-dimensional embeddings, which we have some hope of visualizing.

First of all we will set n_components to 1, forcing UMAP to embedthe data in a line. For visualisation purposes we will randomlydistribute the data on the y-axis to provide some separation betweenpoints.

draw_umap(n_components=1, title='n_components = 1')
Basic UMAP Parameters — umap 0.5 documentation (15)

Now we will try n_components=3. For visualisation we will make useof matplotlib’s basic 3-dimensional plotting.

draw_umap(n_components=3, title='n_components = 3')
/opt/anaconda3/envs/umap_dev/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt return distances if squared else np.sqrt(distances, out=distances)
Basic UMAP Parameters — umap 0.5 documentation (16)

Here we can see that with more dimensions in which to work UMAP has aneasier time separating out the colors in a way that respects thetopological structure of the data.

As mentioned, there is really no requirement to stop at n_components=3. If you are interested in (density based) clustering, or othermachine learning techniques, it can be beneficial to pick a largerembedding dimension (say 10, or 50) closer to the the dimension of theunderlying manifold on which your data lies.

metric

The final UMAP parameter we will be considering in this notebook is themetric parameter. This controls how distance is computed in theambient space of the input data. By default UMAP supports a wide varietyof metrics, including:

Minkowski style metrics

  • euclidean

  • manhattan

  • chebyshev

  • minkowski

Miscellaneous spatial metrics

  • canberra

  • braycurtis

  • haversine

Normalized spatial metrics

  • mahalanobis

  • wminkowski

  • seuclidean

Angular and correlation metrics

  • cosine

  • correlation

Metrics for binary data

  • hamming

  • jaccard

  • dice

  • russellrao

  • kulsinski

  • rogerstanimoto

  • sokalmichener

  • sokalsneath

  • yule

Any of which can be specified by setting metric='<metric name>'; forexample to use cosine distance as the metric you would usemetric='cosine'.

UMAP offers more than this however – it supports custom user definedmetrics as long as those metrics can be compiled in nopython mode bynumba. For this notebook we will be looking at such custom metrics. Todefine such metrics we’ll need numba …

import numba

For our first custom metric we’ll define the distance to be the absolutevalue of difference in the red channel.

@numba.njit()def red_channel_dist(a,b): return np.abs(a[0] - b[0])

To get more adventurous it will be useful to have some colorspaceconversion – to keep things simple we’ll just use HSL formulas toextract the hue, saturation, and lightness from an (R,G,B) tuple.

@numba.njit()def hue(r, g, b): cmax = max(r, g, b) cmin = min(r, g, b) delta = cmax - cmin if cmax == r: return ((g - b) / delta) % 6 elif cmax == g: return ((b - r) / delta) + 2 else: return ((r - g) / delta) + 4@numba.njit()def lightness(r, g, b): cmax = max(r, g, b) cmin = min(r, g, b) return (cmax + cmin) / 2.0@numba.njit()def saturation(r, g, b): cmax = max(r, g, b) cmin = min(r, g, b) chroma = cmax - cmin light = lightness(r, g, b) if light == 1: return 0 else: return chroma / (1 - abs(2*light - 1))

With that in hand we can define three extra distances. The first simplymeasures the difference in hue, the second measures the euclideandistance in a combined saturation and lightness space, while the thirdmeasures distance in the full HSL space.

@numba.njit()def hue_dist(a, b): diff = (hue(a[0], a[1], a[2]) - hue(b[0], b[1], b[2])) % 6 if diff < 0: return diff + 6 else: return diff@numba.njit()def sl_dist(a, b): a_sat = saturation(a[0], a[1], a[2]) b_sat = saturation(b[0], b[1], b[2]) a_light = lightness(a[0], a[1], a[2]) b_light = lightness(b[0], b[1], b[2]) return (a_sat - b_sat)**2 + (a_light - b_light)**2@numba.njit()def hsl_dist(a, b): a_sat = saturation(a[0], a[1], a[2]) b_sat = saturation(b[0], b[1], b[2]) a_light = lightness(a[0], a[1], a[2]) b_light = lightness(b[0], b[1], b[2]) a_hue = hue(a[0], a[1], a[2]) b_hue = hue(b[0], b[1], b[2]) return (a_sat - b_sat)**2 + (a_light - b_light)**2 + (((a_hue - b_hue) % 6) / 6.0)

With such custom metrics in hand we can get UMAP to embed the data usingthose metrics to measure the distance between our input data points. Notethat numba provides significant flexibility in what we can do indefining distance functions. Despite this we retain the high performancewe expect from UMAP even using such custom functions.

for m in ("euclidean", red_channel_dist, sl_dist, hue_dist, hsl_dist): name = m if type(m) is str else m.__name__ draw_umap(n_components=2, metric=m, title='metric = {}'.format(name))
Basic UMAP Parameters — umap 0.5 documentation (17)Basic UMAP Parameters — umap 0.5 documentation (18)Basic UMAP Parameters — umap 0.5 documentation (19)Basic UMAP Parameters — umap 0.5 documentation (20)Basic UMAP Parameters — umap 0.5 documentation (21)

And here we can see the effects of the metrics quite clearly. The purered channel correctly sees the data as living on a one dimensionalmanifold, the hue metric interprets the data as living in a circle, andthe HSL metric fattens out the circle according to the saturation andlightness. This provides a reasonable demonstration of the power andflexibility of UMAP in understanding the underlying topology of data,and finding a suitable low dimensional representation of that topology.

Basic UMAP Parameters — umap 0.5 documentation (2024)

References

Top Articles
Latest Posts
Article information

Author: Dean Jakubowski Ret

Last Updated:

Views: 6115

Rating: 5 / 5 (50 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Dean Jakubowski Ret

Birthday: 1996-05-10

Address: Apt. 425 4346 Santiago Islands, Shariside, AK 38830-1874

Phone: +96313309894162

Job: Legacy Sales Designer

Hobby: Baseball, Wood carving, Candle making, Jigsaw puzzles, Lacemaking, Parkour, Drawing

Introduction: My name is Dean Jakubowski Ret, I am a enthusiastic, friendly, homely, handsome, zealous, brainy, elegant person who loves writing and wants to share my knowledge and understanding with you.