Slow fit for small datasets using UMAP or densMAP
Is it normal for UMAP or densMAP to take 5-40 s to fit a small dataset of precomputed distances (e.g. 10x10 distance matrix)? My "test" scripts for a separate project which include a couple of UMAP calls end up taking about as much time as if I used a 10000 x 10000 distance matrix.
Reproducible script:
"""Time various calls to UMAP."""
import umap
import numpy as np
from time import time
from scipy.spatial.distance import pdist, squareform
class Timer(object):
"""
Simple timer class.
https://stackoverflow.com/a/5849861/13697228
Usage
-----
with Timer("description"):
# do stuff
"""
def __init__(self, name=None):
"""Record name."""
self.name = name
def __enter__(self):
"""Enter the timer."""
self.tstart = time()
def __exit__(self, type, value, traceback):
"""Exit the timer."""
if self.name:
print(
"[%s]" % self.name,
)
print(("Elapsed: {}\n").format(round((time() - self.tstart), 5)))
X = np.random.rand(10, 5)
eucl_dm = squareform(pdist(X))
# fmt: off
# Earth Mover's Distance matrix (not related to X)
emd_dm = np.array([[ 0. , 27. , 27. , 6.86666667, 15. ,
19.65 , 10.8 , 15.6 , 9.71428571, 19.6 ],
[27. , 0. , 14.5 , 26.66666667, 20.6 ,
20.25 , 35. , 19. , 19.42857143, 25. ],
[27. , 14.5 , 0. , 23.33333333, 14.2 ,
11.25 , 24. , 18.6 , 18.42857143, 25. ],
[ 6.86666667, 26.66666667, 23.33333333, 0. , 12.66666667,
18.41666667, 13. , 13.26666667, 9.57142857, 16.33333333],
[15. , 20.6 , 14.2 , 12.66666667, 0. ,
8.15 , 18.8 , 19.4 , 16.42857143, 24.6 ],
[19.65 , 20.25 , 11.25 , 18.41666667, 8.15 ,
0. , 19.75 , 23.85 , 17.53571429, 29.25 ],
[10.8 , 35. , 24. , 13. , 18.8 ,
19.75 , 0. , 24.6 , 17.71428571, 29. ],
[15.6 , 19. , 18.6 , 13.26666667, 19.4 ,
23.85 , 24.6 , 0. , 9.02857143, 6.8 ],
[ 9.71428571, 19.42857143, 18.42857143, 9.57142857, 16.42857143,
17.53571429, 17.71428571, 9.02857143, 0. , 13.42857143],
[19.6 , 25. , 25. , 16.33333333, 24.6 ,
29.25 , 29. , 6.8 , 13.42857143, 0. ]]) #
# fmt: on
with Timer("Euclidean UMAP"):
umap.UMAP().fit(X)
with Timer("Precomputed Euclidean UMAP"):
umap.UMAP(metric="precomputed").fit(eucl_dm)
with Timer("Precomputed EMD UMAP"):
umap.UMAP(metric="precomputed").fit(emd_dm)
with Timer("Precomputed EMD densMAP"):
umap_trans = umap.UMAP(
densmap=True,
output_dens=True,
dens_lambda=1.0,
min_dist=0.01,
n_components=2,
metric="precomputed",
).fit(emd_dm)
with Timer("Precomputed EMD densMAP"):
std_trans = umap.UMAP(
densmap=True,
output_dens=True,
dens_lambda=1.0,
metric="precomputed",
).fit(emd_dm)
with Timer("Precomputed EMD UMAP"):
std_trans = umap.UMAP(metric="precomputed").fit(emd_dm)
Output:
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Euclidean UMAP] Elapsed: 14.15193 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed Euclidean UMAP] Elapsed: 8.00694 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD UMAP] Elapsed: 4.76215 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD densMAP] Elapsed: 28.30055 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD densMAP] Elapsed: 11.316 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD UMAP] Elapsed: 5.19057
I'm noticing that very small datasets (e.g. 100 points) seems to be slower than slightly larger datasets (e.g. 800 points) according to the docs: benchmarking results, though I haven't been able to reproduce this behavior.
Oddly (to me at least), it seems to run ~2x faster in an IPython kernel 🤨
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Euclidean UMAP] Elapsed: 6.28317 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed Euclidean UMAP] Elapsed: 3.50397 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD UMAP] Elapsed: 1.83107 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD densMAP] Elapsed: 12.95584 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD densMAP] Elapsed: 4.81335 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD UMAP] Elapsed: 1.92252
Also, setting random_state=42 didn't seem to affect it much.
Perhaps I should use PCA as a fast swap-in, and then switch back to UMAP for production purposes. For densMAP values (i.e. radius), I might just need to use a random number generator.
There is definitely some overhead that is going to be bad for the small cases, but largely irrelevant for the large cases. As to actually being slower for very small cases -- there's no obvious reason why that should be so, but may be due to the optimization phase simply not working well for small data. The sampling based methodology, particularly the negative sampling, works fine for larger datasets, but for very small datasets it starts violating assumptions (that, in all likelihood, a random pair of samples is unrelated). This could conceivably slow things down (and I wouldn't trust the results either). As a rule of thumb I would be worried whenever n_neighbors / n_samples isn't around 0.1 or less -- the smaller the better with regard to optimization assumptions.