umap icon indicating copy to clipboard operation
umap copied to clipboard

Random state doesn't work with metric='precomputed'

Open Odessit007 opened this issue 1 year ago • 2 comments

Hi. I'm facing an unexpected reproducibility issue with my dataset. I have quite a large matrix and doing some hyperparameter tuning that includes UMAP. To optimize run time, I cached data preprocessing and distance computations. When I tried to apply the "best" set of hyperparameters in my notebook, I got different results despite random_state set. My UMAP version is 0.5.3.

Sharing the code to reproduce and a small section of my data for which I'm facing the reproducibility issue just as for the full dataset.

import pickle
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import umap


with open('example.pkl', 'rb') as fin:
    B = pickle.load(fin)


# No issue when passing data directly
embedding1 = umap.UMAP(
        n_neighbors=10,
        random_state=42
    ).fit_transform(B)

embedding2 = umap.UMAP(
        n_neighbors=10,
        random_state=42
    ).fit_transform(B)

print(np.max(np.abs(embedding1 - embedding2)))


# But if I use metric='precomputed' and pass distance matrix, the reproducibility is lost.
D = euclidean_distances(B)

embedding3 = umap.UMAP(
        n_neighbors=10,
        random_state=42,
        metric='precomputed'
    ).fit_transform(D)

embedding4 = umap.UMAP(
        n_neighbors=10,
        random_state=42,
        metric='precomputed'
    ).fit_transform(D)

print(np.max(np.abs(embedding3 - embedding4)))
Screenshot 2023-06-09 at 01 08 21

example.pkl.zip

Odessit007 avatar Jun 08 '23 22:06 Odessit007

Interestingly, if I use random data of the same size, the problem doesn't happen.

B2 = np.random.normal(size=B.shape)
D2 = euclidean_distances(B2)

embedding5 = umap.UMAP(
        n_neighbors=10,
        random_state=42,
        metric='precomputed'
    ).fit_transform(D2)

embedding6 = umap.UMAP(
        n_neighbors=10,
        random_state=42,
        metric='precomputed'
    ).fit_transform(D2)

print(np.max(np.abs(embedding5 - embedding6)))
Screenshot 2023-06-09 at 01 13 50

Odessit007 avatar Jun 08 '23 22:06 Odessit007

I have the same problem with data of mine. The peculiar thing is that I cannot create a minimum reproducible example. It seems that it's something very specific about my data. I cannot recreate the error with simulated data. The pairwise distance matrix I use does not have any NaN values and the values are between 0 and 1.

It seems that if I reduce the size it works (see code snippets below) but this size issue again does not appear for simulated data.

mapper1 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming)
mapper2 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming)

np.all(mapper1.embedding_ == mapper2.embedding_)

return False

but

mapper1 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming[:1000, :1000])
mapper2 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming[:1000, :1000])

np.all(mapper1.embedding_ == mapper2.embedding_)

returns True

and so does

mapper1 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming[1000:, 1000:])
mapper2 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming[1000:, 1000:])

np.all(mapper1.embedding_ == mapper2.embedding_)

Any help would be greatly appreciated :)

tlkaufmann avatar Jun 21 '23 15:06 tlkaufmann