umap icon indicating copy to clipboard operation
umap copied to clipboard

Reproducibility depends on `n_neighbors`

Open dericdesta opened this issue 2 years ago • 3 comments

I encountered an issue with the reproducibility of UMAP v0.5.3 (installed via conda). Here's the code:

import umap

import numpy as np


random_data = np.random.random((100, 1)).astype(np.float32)

for n_neighbors in range(2, 10):

    mapper_1 = umap.UMAP(n_neighbors=n_neighbors, random_state=1337)
    mapper_2 = umap.UMAP(n_neighbors=n_neighbors, random_state=1337)

    embedding_1 = mapper_1.fit_transform(random_data)
    embedding_2 = mapper_2.fit_transform(random_data)

    distance = np.linalg.norm(embedding_1 - embedding_2)

    print(f"{n_neighbors=}: {np.allclose(embedding_1, embedding_2)=}, {distance=}")

I get non-reproducible results for at least n_neighbors=2 and n_neighbors=3. However, even this behavior is not reproducible:

# output of first run of above code snippet
n_neighbors=2: np.allclose(embedding_1, embedding_2)=False, distance=107.652954
n_neighbors=3: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=4: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=5: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=6: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=7: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=8: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=9: np.allclose(embedding_1, embedding_2)=True, distance=0.0

# output of another run, i.e., with new random_data
n_neighbors=2: np.allclose(embedding_1, embedding_2)=False, distance=130.44937
n_neighbors=3: np.allclose(embedding_1, embedding_2)=False, distance=97.9669
n_neighbors=4: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=5: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=6: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=7: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=8: np.allclose(embedding_1, embedding_2)=True, distance=0.0
n_neighbors=9: np.allclose(embedding_1, embedding_2)=True, distance=0.0

I'm glad for any help. Cheers! :slightly_smiling_face:

dericdesta avatar Jul 20 '22 13:07 dericdesta

There is a strange issue in the spectral initialization when multiple connected components of the graph occur. I did sped some time on this and may have tracked down the issue (I don't recall exactly), but there was some subtlety to it, so it may persist. You can work around it by using init=random if necessary.

lmcinnes avatar Jul 20 '22 14:07 lmcinnes

not sure if the issue still exists but that workaround did serve as fix when encountering similar problem as OP, thanks

Mirage-deux avatar Aug 24 '22 00:08 Mirage-deux

@lmcinnes I believe that this issue should be mentioned in the UMAP Reproducibility documentation. I struggled with UMAP not working until I realised it was due to n_neighbors value. This information was not clearly documented in the existing documentation, leading to confusion for users .

Hamza-nabil avatar Apr 05 '23 05:04 Hamza-nabil