umap
umap copied to clipboard
Reproducibility depends on `n_neighbors`
I encountered an issue with the reproducibility of UMAP v0.5.3 (installed via conda). Here's the code:
import umap
import numpy as np
random_data = np.random.random((100, 1)).astype(np.float32)
for n_neighbors in range(2, 10):
mapper_1 = umap.UMAP(n_neighbors=n_neighbors, random_state=1337)
mapper_2 = umap.UMAP(n_neighbors=n_neighbors, random_state=1337)
embedding_1 = mapper_1.fit_transform(random_data)
embedding_2 = mapper_2.fit_transform(random_data)
distance = np.linalg.norm(embedding_1 - embedding_2)
print(f"{n_neighbors=}: {np.allclose(embedding_1, embedding_2)=}, {distance=}")
I get non-reproducible results for at least n_neighbors=2
and n_neighbors=3
. However, even this behavior is not reproducible:
# output of first run of above code snippet
n_neighbors=2: np.allclose(embedding_1, embedding_2)=False, distance=107.652954
n_neighbors=3: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=4: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=5: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=6: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=7: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=8: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=9: np.allclose(embedding_1, embedding_2)=True, distance=0.0
# output of another run, i.e., with new random_data
n_neighbors=2: np.allclose(embedding_1, embedding_2)=False, distance=130.44937
n_neighbors=3: np.allclose(embedding_1, embedding_2)=False, distance=97.9669
n_neighbors=4: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=5: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=6: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=7: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=8: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=9: np.allclose(embedding_1, embedding_2)=True, distance=0.0
I'm glad for any help. Cheers! :slightly_smiling_face:
There is a strange issue in the spectral initialization when multiple connected components of the graph occur. I did sped some time on this and may have tracked down the issue (I don't recall exactly), but there was some subtlety to it, so it may persist. You can work around it by using init=random
if necessary.
not sure if the issue still exists but that workaround did serve as fix when encountering similar problem as OP, thanks
@lmcinnes I believe that this issue should be mentioned in the UMAP Reproducibility documentation. I struggled with UMAP not working until I realised it was due to n_neighbors
value. This information was not clearly documented in the existing documentation, leading to confusion for users .