umap icon indicating copy to clipboard operation
umap copied to clipboard

Transform gives very different results for the same data points if there are other rows or not

Open nbeuchat opened this issue 3 years ago • 9 comments

Hi there!

I have an issue when running umap.transform(X) on an already fitted UMAP object. The same data points give completely different embeddings if they are in a larger X with other rows. For some reasons, it seems that each row is not independent when going through the transform and results in embeddings that are off.

What could be the reason for it? Is there a parameter that I could use to prevent this at transform time?

When I run the transform twice on the same X and compute the difference of the resulting embedding, I get an average absolute difference of 0.0146 which is fine (0.5% difference).

But when I run the transform on a larger X and compare the output to the transform on the same dataset for just the first two rows, the results are completely different and I get an average absolute difference of 2.958 (82% difference!)

Here is the code that I have:

n_samples = 2
X_full.shape  # 3 x 512 -> same happens with much higher values of N
X_limited = X_full[:n_samples, ]  # 2 x 512

y_full = umap.transform(X_full) 
y_full_repeat = umap.transform(X_full) 
y_limited = umap.transform(X_limited) 

# Look at the difference in transforms for the same datapoints (the first two rows)
diff_repeat = (abs(y_full[:n_samples] - y_full_repeat[:n_samples])).mean()  #  `0.0146295605`
diff_limited = (abs(y_limited - y_full[:n_samples])).mean()  #  `2.9586139`

The problem is that I am using this in a scikit-learn pipeline and if I predict multiple samples at once, it gives different results as the transformed embedding is off.

Thanks a lot for your help! I hope my explanations were clear enough. Cheers, Nicolas

nbeuchat avatar Jan 30 '21 18:01 nbeuchat

The transform is stochastic, so unfortunately there is not way to remedy this. If you really need a consistent (and presumably fast) transform I would recommend looking at the ParametricUMAP option, which has a neural network learn the function.

lmcinnes avatar Jan 30 '21 20:01 lmcinnes

Thanks for your feedback, I'll look into the parametric UMAP.

I understand that it's stochastic and that the output will always be slightly different on each run. But why would the transform for a row be completely different only if there are other certain rows in the input matrix? Shouldn't each row be independent?

nbeuchat avatar Jan 31 '21 00:01 nbeuchat

It handles things in batches, so permuting, or adding rows will change things. It shouldn't be completely different (that may be an issue), but it will certainly change.

lmcinnes avatar Jan 31 '21 01:01 lmcinnes

Understood, it makes sense. In my case, it does return something completely different (about a factor 2 on each element but not exactly).

X_sent_umap = umap.transform(X_sent)  # input: 3x512
X_sent_umapr = umap.transform(X_sent)  # Just repeating again with the same 3x512 input
X_sent_umap_2 = umap.transform(X_sent[:2,:])  # Remove the last row. Input: 2x512

Gives: image

I'd be happy to share a pickled version of the model with some example inputs if that helps!

nbeuchat avatar Jan 31 '21 02:01 nbeuchat

Okay, that means there is likely something different happening in the nearest neighbor search, or, simply, the point is effectively torn between a few options -- i.e. it's embedding is unstable simply because exactly where it falls in high space puts it between two or more masses, and it is a lonely point somewhere between, so where it should go is less clear.

lmcinnes avatar Jan 31 '21 21:01 lmcinnes

I am also seeing this issue, the differences are significant.

mtngld avatar Mar 10 '21 22:03 mtngld

Perhaps there should be a paragraph about this behavior on https://umap-learn.readthedocs.io/en/latest/transform.html?

jondo avatar Apr 09 '21 15:04 jondo

A PR would be welcome.

lmcinnes avatar Apr 09 '21 20:04 lmcinnes

I know this issue hasn't been touched in a while, but I wanted to add that the stochastic nature of umap can come up even in not-so-boundary points. In my case, I was running some tests and noticed that I was rarely getting wildly different values for the same input to the same model.

See below: [[15.7598877 18.45818901] [15.74145603 18.47720909] [15.62852764 18.38959503] [19.18797684 16.44251251] [15.7022028 18.43085289] [15.63010788 18.44755936] [15.78700447 18.50206375] [15.70940495 18.47913742] [15.79792881 18.44288445] [15.76255035 18.47006416] [15.68696022 18.37179184] [19.04182816 16.05806732] [15.83024693 18.52859116] [15.72168922 18.53590965] [15.67156887 18.51717758] [15.91786957 18.3213253 ] [15.63369083 18.69378471] [15.7531805 18.52090454] [15.9455471 18.71170616] [15.7203846 18.43262863]]

That's the output from a umap model that is taking an array of 20 of the same input. While most of the outputs hover around [~15, ~18], I see two results that are [~19, ~16].

@lmcinnes does the umap algorithm imply that even a very high confidence data point could, in some instantiation, get a totally wrong result in a big enough batch?

theahura avatar May 31 '22 01:05 theahura