umap icon indicating copy to clipboard operation
umap copied to clipboard

Confusing output when using output_metric="cosine"

Open billypeanut opened this issue 4 years ago • 5 comments

Hi there,

I'm getting some strange results when requesting the output metric of my results be "cosine", and I'm unsure whether it's my misinterpretation of what this is doing (probable) or a mistake in the code (less probable). I assume that the output_metric argument means that this metric is the measure of distance to use to interpret the output points. As a test, I input a pre-calculated distance matrix: [[0,1,2,1], [1,0,1,2], [2,1,0,1], [1,2,1,0]] which could describe 4 points, A,B,C,D, equally spaced around an origin point, measured with a cosine distance metric. I would therefore have expected the output points from UMAP to look something like that, and to have distance properties that closely match the input distance matrix. Instead, I get 4 points all along the same line, which if I normalise, end up having almost the exact same coordinates, and therefore have a cosine distance of 0 between them all. Am I right in what I'm expecting as an output, or am I misinterpreting things? One thing that seems to be missing is that a Euclidean space is unbounded, but cosine distance must be between 0 and 2, so I don't know how UMAP determines things to be as-far-apart-as-possible in a cosine distance metric space. Perhaps it's expecting points with an output distance of 2 in the cosine space to have a distance of 'Inf' in the input distance matrix and since the furthest distance in my matrix is 2, that's considered to be quite close? I can't see if there are any other input arguments to tell UMAP that 2 is actually the farthest apart two points can be - since it's a pre-calculated matrix, I can't give any more information about the distance matrix.

Danke, BP

billypeanut avatar May 17 '21 11:05 billypeanut

Bounded metrics as output metrics are often a bad idea. You would probably be better off doing something akin to the example of hyperbolic space in the docs: find a space with more amenable properties that has a convenient isometry to the one you want and use that instead.

lmcinnes avatar May 17 '21 14:05 lmcinnes

Thanks for the super fast response. I did also try the Haversine output_metric, which was quite nice because it forces things to the sphere which gives me nice cosine properties. I think I need to think about the problem some more. Thanks again! BP

billypeanut avatar May 17 '21 15:05 billypeanut

@lmcinnes I just ran into the same issue. output_metric="haversine" like in the docs works fine, but output_metric="cosine" gives nonsense results. Can you clarify, what you mean by "bounded metrics"? Isn't haversine also bounded? Also, the points generated by output_metric="cosine" aren't even constrained to the unit sphere, so I am highly suspicious of the results.

For example, the following code:

import numpy as np
import plotly.graph_objs as go
import umap

data = np.random.randn(1024, 3)
data = data / np.linalg.norm(data, axis=-1, keepdims=True)

mapper = umap.UMAP(n_components=3, metric="cosine", output_metric="cosine")
mapper = mapper.fit(data)
x, y, z = mapper.embedding_.T

fig = go.Figure()
fig.add_trace(go.Scatter3d(x=x, y=y, z=z, mode="markers"))
fig.show()

It attempts to map 3d input data (uniformly distributed on a unit sphere) to a 3d output, using cosine distance for both the input and the output metrics. However, the resulting points all lie on a single line with values around the 70-105 range, which makes zero sense:

Image

I tried scaling the input metric and adjusting min_dist and spread, but I wasn't able to make it produce any even remotely sensible looking results.

ruro avatar Apr 24 '25 11:04 ruro

I think output_metric="cosine" will be quirky especially if you don't start with a good initialization (and spectral will likely not be great). I also think it is worth looking at the Haversine example -- the data wraps around and the raw coordinates don't produce sensible looking results -- you need an appropriate transform to get things working better. That may or may not help.

lmcinnes avatar Apr 24 '25 18:04 lmcinnes

Well, I understand that the Haversine metric essentially operates on the 2d reparametrization instead of the full 3d vectors, however I don't understand, how is that relevant.

Cosine distance is bounded in range [0; 2], Haversine distance is bounded in range [0; pi]. Cosine distance has the invariant/assumption that the vectors it operates on have unit length. Haversine distance has the invariant/assumption that the latitude and longitude are periodic.

In fact, I think that cosine distance and haversine distance should be equivalent (up to a monotonic distance transform). The fact that Haversine distance produces perfectly reasonable/sensible results (after reparametrizing back to 3d) while the Cosine distance dumps all of the points onto a single line/point (even taking into account normalizing/projecting the results onto the unit sphere) seems extremely suspicious to me.

ruro avatar Apr 24 '25 20:04 ruro