cuml
cuml copied to clipboard
[BUG] the UMAP implementation much worse than on CPU
Describe the bug
The GPU-accelerated implementation from cuml can give much worse results than the CPU alternative from the package umap on a simple dataset. By visual inspection, we see that the clusters are less separable and there are many outliers. I wonder if the gap could be bridged somehow by non-obvious customization (that my example is missing)? Any help appreciated 🙏
NOTE: I show a toy example to facilitate debugging. I have also seen a complex NLP pipeline with UMAP responsible for dimensionality reduction, where switching from umap to cuml cost as much as 8% in terms of the coherence score.
Steps/Code to reproduce bug
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
%matplotlib inline
import umap
import cuml
X,y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
y = y.astype(int)
umap_model_1 = umap.UMAP(random_state=42, n_components=2, n_neighbors=12, min_dist=0.0).fit(X)
umap_model_2 = cuml.UMAP(random_state=42, n_components=2, n_neighbors=12, min_dist=0.0).fit(X)
embeds_1 = umap_model_1.transform(X)
embeds_2 = umap_model_2.transform(X)
fig,axs = plt.subplots(1,2)
axs[0].scatter(embeds_1[:,0], embeds_1[:,1], c=y, s=0.1, cmap='Spectral')
axs[1].scatter(embeds_2[:,0], embeds_2[:,1], c=y, s=0.1, cmap='Spectral')
plt.show()
Expected behavior
Results should be much closer.
Environment details (please complete the following information):
- Linux Distro/Architecture: [Ubuntu 20.04 x86_64]
- GPU Model/Driver: [L4 / GeForce RTX 3090]
- CUDA: [12.2]
- Docker by NVIDIA: nvcr.io/nvidia/pytorch:23.09-py3
Thanks for the issue @maciejskorski, and thanks for the great and easy to repro example/code :). Can confirm repro'ing on totally different hardware, we'll be looking into it alongside a few updates we want to do to UMAP. The discrepancies, alongside points between clusters are larger than I would've expected. Will update issue as we progress with findings.
Can we get an update on this issue? I also faced it while trying to use cuml.UMAP
to speed up BERTopic
. I've noticed that the problem gets much worse when the number of prediction samples increases relative to the number of training samples.
Building on @maciejskorski's example above:
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
%matplotlib inline
import umap
import cuml
X,y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
y = y.astype(int)
umap_model_1 = umap.UMAP(random_state=42, n_components=2, n_neighbors=12, min_dist=0.0).fit(X[:10000])
umap_model_2 = cuml.UMAP(random_state=42, n_components=2, n_neighbors=12, min_dist=0.0).fit(X[:10000])
embeds_1_small = umap_model_1.transform(X[10000:20000])
embeds_2_small = umap_model_2.transform(X[10000:20000])
embeds_1_large = umap_model_1.transform(X[10000:70000])
embeds_2_large = umap_model_2.transform(X[10000:70000])
fig,axs = plt.subplots(2,2, figsize=(12, 8))
axs[0, 0].scatter(embeds_1_small[:,0], embeds_1_small[:,1], c=y[10000:20000], s=0.1, cmap='Spectral')
axs[0, 0].set_title("CPU, 10k predictions")
axs[0, 1].scatter(embeds_2_small[:,0], embeds_2_small[:,1], c=y[10000:20000], s=0.1, cmap='Spectral')
axs[0, 1].set_title("GPU, 10k predictions")
axs[1, 0].scatter(embeds_1_large[:,0], embeds_1_large[:,1], c=y[10000:70000], s=0.1, cmap='Spectral')
axs[1, 0].set_title("CPU, 60k predictions")
axs[1, 1].scatter(embeds_2_large[:,0], embeds_2_large[:,1], c=y[10000:70000], s=0.1, cmap='Spectral')
axs[1, 1].set_title("GPU, 60k predictions")
plt.show()
Echoing @Bougeant, this has been my experience using cuML UMAP with BERTopic, as well, to the point that I never use the cuML implementation of UMAP. It simply has never worked well for any of the pre-trained embedding models I've used. I always get the results @Bougeant shows in the bottom right figure.
Also echoing this, in my experience the results from cuML's UMAP are often unusable, which is a shame as it's so fast! Funnily enough though I've seen the opposite behaviour to @Bougeant; when running transform
on ~25k instances, the clusters are fairly well separated when fit
is run on a random sample of 1000 instances, but increasing this to 5000 or running fit_transform
on the 25k leads to results like the bottom right figure.
Fingers crossed on a fix for this one @dantegd! 🤞🏻
@MaartenGr, as the author of Bertopic, you might be interested in this.
Hey everyone,
Sorry for being late to this discussion. I think one of the problems here might be the assumption that cuML's UMAP will always yield the exact same results as the CPU-based reference implementation for the exact same parameter settings. We did some parallelism magic on the GPU side to speed up the algorithm and as a result, it's possible that some of the parameters (such as the number of iterations for the solver) might need to be tweaked a bit.
In addition, the underlying GPU-accelerated spectral embedding initialization primitive has gotten fairly old by this point and hasn't been updated in quite some time so it's been accumulating little bugs as CUDA versions increase and the code itself becomes more stale. I suggest trying to use the random initialization, along with adjusting the number of neighbors and the number of iterations to see if that improves the quality of your embeddings.
We have an engineer ramping up to fix the spectral clustering initialization and they will also be working to improve the end to end quality of the results. Again, I apology sincerely for the delay in replying to this thread.
Thanks for the follow up @cjnolet — happy to hear this is being worked on!
Thanks for the update @cjnolet!
I think it's quite clear to me that we're not going to get exactly the same results with cuML's UMAP and the CPU-based umap.UMAP. However, while in my example above, the CPU and GPU clusters for 10k datapoints are probably of similar quality, the one with 60k datapoints is clearly worse for the GPU generated clusters (even though when we zoom into the +-15 range, the clusters look decent for the GPU case as well).
@Bougeant Yes, the GPU should not look worse than CPU, that's not expected.
What is expected, though, is that the same parameter settings might yield different results, and that sometimes causes the need for the number of iterations to be tweaked.
If you have a moment to try the init=random, it would be helpful to know if that improves anything for you.
I ran into this issue myself trying to reduce 200k text embeddings to 2D. First off, it's impressive that the UMAP can run on such a larger dataset in only a couple seconds. :)
But yes, I am also seeing poor performance such as wide x/y ranges (+- 20) on the reduced embeddings, even when using init="random"
and tweaking other parameters such as num_epochs
.
HI all, we are working on single cell spatial data with like millions of cells, the embedding we see from GPU is not as clear as ones from CPU implementation. Been watching this issue since Apr and had just check in to see the updates. It is true we do not expect to see exact result, but we would expect to see similar degree of separation or performance of resulting clouds. While the GPU is clearly the path to go for us to ramp up analysis speed, we do like some clarity on this issue. We are watching another issue here: https://github.com/rapidsai/cuml/issues/5782, which we think is related to the poor performance observed here, and no update in that ticket.
The initialization of UMAP is indeed super important to attain quality in the global structure (https://www.nature.com/articles/s41587-020-00809-z). And cuML seems to have a bug regarding that (see #5782). The random initialization is not a good option according to these, and does not seem to capture the global structure well, at least with the default parameters.
I've managed to get better UMAPs by astronomically increasing the number of epochs (to 500000!), the number of neighbors and negative sample rate (corresponding to the repulsive realizations). By doing that I end up with nice results even for large datasets. However this completely undermines the usefulness of a GPU-based method as it can take more time than the CPU implementation to attain a similar quality. I think this is a critical limitation of cuML and a dangerous one for scientific analysis. Libraries for single cell analysis are now providing the option of using cuML to perform UMAP.
Would it be possible (and maybe easier to implement) a way to provide our own initialization as a parameter?
This would allow us to use PCA, for instance, which can lead to higher quality than the current implementation or random initialization. This would also allow us to apply UMAP and resume it afterwards with more iterations.