Confusing output when using output_metric="cosine"
Hi there,
I'm getting some strange results when requesting the output metric of my results be "cosine", and I'm unsure whether it's my misinterpretation of what this is doing (probable) or a mistake in the code (less probable). I assume that the output_metric argument means that this metric is the measure of distance to use to interpret the output points. As a test, I input a pre-calculated distance matrix: [[0,1,2,1], [1,0,1,2], [2,1,0,1], [1,2,1,0]] which could describe 4 points, A,B,C,D, equally spaced around an origin point, measured with a cosine distance metric. I would therefore have expected the output points from UMAP to look something like that, and to have distance properties that closely match the input distance matrix. Instead, I get 4 points all along the same line, which if I normalise, end up having almost the exact same coordinates, and therefore have a cosine distance of 0 between them all. Am I right in what I'm expecting as an output, or am I misinterpreting things? One thing that seems to be missing is that a Euclidean space is unbounded, but cosine distance must be between 0 and 2, so I don't know how UMAP determines things to be as-far-apart-as-possible in a cosine distance metric space. Perhaps it's expecting points with an output distance of 2 in the cosine space to have a distance of 'Inf' in the input distance matrix and since the furthest distance in my matrix is 2, that's considered to be quite close? I can't see if there are any other input arguments to tell UMAP that 2 is actually the farthest apart two points can be - since it's a pre-calculated matrix, I can't give any more information about the distance matrix.
Danke, BP
Bounded metrics as output metrics are often a bad idea. You would probably be better off doing something akin to the example of hyperbolic space in the docs: find a space with more amenable properties that has a convenient isometry to the one you want and use that instead.
Thanks for the super fast response. I did also try the Haversine output_metric, which was quite nice because it forces things to the sphere which gives me nice cosine properties. I think I need to think about the problem some more. Thanks again! BP
@lmcinnes I just ran into the same issue. output_metric="haversine" like in the docs works fine, but output_metric="cosine" gives nonsense results. Can you clarify, what you mean by "bounded metrics"? Isn't haversine also bounded? Also, the points generated by output_metric="cosine" aren't even constrained to the unit sphere, so I am highly suspicious of the results.
For example, the following code:
import numpy as np
import plotly.graph_objs as go
import umap
data = np.random.randn(1024, 3)
data = data / np.linalg.norm(data, axis=-1, keepdims=True)
mapper = umap.UMAP(n_components=3, metric="cosine", output_metric="cosine")
mapper = mapper.fit(data)
x, y, z = mapper.embedding_.T
fig = go.Figure()
fig.add_trace(go.Scatter3d(x=x, y=y, z=z, mode="markers"))
fig.show()
It attempts to map 3d input data (uniformly distributed on a unit sphere) to a 3d output, using cosine distance for both the input and the output metrics. However, the resulting points all lie on a single line with values around the 70-105 range, which makes zero sense:
I tried scaling the input metric and adjusting min_dist and spread, but I wasn't able to make it produce any even remotely sensible looking results.
I think output_metric="cosine" will be quirky especially if you don't start with a good initialization (and spectral will likely not be great). I also think it is worth looking at the Haversine example -- the data wraps around and the raw coordinates don't produce sensible looking results -- you need an appropriate transform to get things working better. That may or may not help.
Well, I understand that the Haversine metric essentially operates on the 2d reparametrization instead of the full 3d vectors, however I don't understand, how is that relevant.
Cosine distance is bounded in range [0; 2], Haversine distance is bounded in range [0; pi]. Cosine distance has the invariant/assumption that the vectors it operates on have unit length. Haversine distance has the invariant/assumption that the latitude and longitude are periodic.
In fact, I think that cosine distance and haversine distance should be equivalent (up to a monotonic distance transform). The fact that Haversine distance produces perfectly reasonable/sensible results (after reparametrizing back to 3d) while the Cosine distance dumps all of the points onto a single line/point (even taking into account normalizing/projecting the results onto the unit sphere) seems extremely suspicious to me.