umap
umap copied to clipboard
Random state doesn't work with metric='precomputed'
Hi. I'm facing an unexpected reproducibility issue with my dataset.
I have quite a large matrix and doing some hyperparameter tuning that includes UMAP. To optimize run time, I cached data preprocessing and distance computations. When I tried to apply the "best" set of hyperparameters in my notebook, I got different results despite random_state
set. My UMAP version is 0.5.3
.
Sharing the code to reproduce and a small section of my data for which I'm facing the reproducibility issue just as for the full dataset.
import pickle
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import umap
with open('example.pkl', 'rb') as fin:
B = pickle.load(fin)
# No issue when passing data directly
embedding1 = umap.UMAP(
n_neighbors=10,
random_state=42
).fit_transform(B)
embedding2 = umap.UMAP(
n_neighbors=10,
random_state=42
).fit_transform(B)
print(np.max(np.abs(embedding1 - embedding2)))
# But if I use metric='precomputed' and pass distance matrix, the reproducibility is lost.
D = euclidean_distances(B)
embedding3 = umap.UMAP(
n_neighbors=10,
random_state=42,
metric='precomputed'
).fit_transform(D)
embedding4 = umap.UMAP(
n_neighbors=10,
random_state=42,
metric='precomputed'
).fit_transform(D)
print(np.max(np.abs(embedding3 - embedding4)))
Interestingly, if I use random data of the same size, the problem doesn't happen.
B2 = np.random.normal(size=B.shape)
D2 = euclidean_distances(B2)
embedding5 = umap.UMAP(
n_neighbors=10,
random_state=42,
metric='precomputed'
).fit_transform(D2)
embedding6 = umap.UMAP(
n_neighbors=10,
random_state=42,
metric='precomputed'
).fit_transform(D2)
print(np.max(np.abs(embedding5 - embedding6)))
I have the same problem with data of mine. The peculiar thing is that I cannot create a minimum reproducible example. It seems that it's something very specific about my data. I cannot recreate the error with simulated data. The pairwise distance matrix I use does not have any NaN values and the values are between 0 and 1.
It seems that if I reduce the size it works (see code snippets below) but this size issue again does not appear for simulated data.
mapper1 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming)
mapper2 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming)
np.all(mapper1.embedding_ == mapper2.embedding_)
return False
but
mapper1 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming[:1000, :1000])
mapper2 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming[:1000, :1000])
np.all(mapper1.embedding_ == mapper2.embedding_)
returns True
and so does
mapper1 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming[1000:, 1000:])
mapper2 = umap.UMAP(metric='precomputed', random_state=42).fit(pdm_hamming[1000:, 1000:])
np.all(mapper1.embedding_ == mapper2.embedding_)
Any help would be greatly appreciated :)