cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[FEA] TSNE and UMAP: Allow input to be precomputed distance matrix

Open RichieHakim opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe. Yes. Currently, TSNE and UMAP do not actually allow for a precomputed distance matrix to be used. This is despite there being arguments that suggest this functionality exists.

Describe the solution you'd like Implementation of methods to allow for a custom distance matrix to be used as the sole data input to TSNE and/or UMAP. This functionality currently exists in sklearn and would allow for users already using this feature in sklearn's TSNE to directly port their workflow to rapids. https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html metricstr or callable, default=’euclidean’: ...If metric is “precomputed”, X is assumed to be a distance matrix. ...

Describe alternatives you've considered What currently exists is the knn_graph argument for the .fit method. The documentation suggests that allowing for a precomputed distance matrix to be used by providing "a sparse array containing the k-nearest neighbors" should be possible, but this method is not functional.

Additional context This is a highly desirable feature. Similar requests here, here, here

Thanks!

RichieHakim avatar Jul 04 '22 21:07 RichieHakim

@RichieHakim did you try using the knn_graph argument and using the fit_transform API with the data you wish to transform? Or is the request that a workaround be avoided, and the API work exactly as sklearn's?

divyegala avatar Jul 27 '22 00:07 divyegala

@divyegala That method doesn't work. Same as here: https://github.com/rapidsai/cuml/issues/4460#issuecomment-1011539376 I believe.

RichieHakim avatar Jul 27 '22 04:07 RichieHakim

Could you be more specific about what doesn't work for you? Is it with a specific metric? Do you have a crash or simply bad looking results?

Here is an example for UMAP :

from cuml.datasets import make_blobs
from cuml.neighbors import NearestNeighbors
from cuml.manifold.umap import UMAP

n_samples=10000
n_features=20
n_neighbors=5

X, _ = make_blobs(n_samples=n_samples, n_features=n_features)
nn = NearestNeighbors(n_neighbors=n_neighbors)
nn.fit(X)
knn_graph = nn.kneighbors_graph(X, mode="distance")

model = UMAP(random_state=42, init='random', n_neighbors=n_neighbors)
y = model.fit_transform(X, knn_graph=knn_graph.tocsr(), convert_dtype=True)

viclafargue avatar Aug 08 '22 15:08 viclafargue

After reading some of your posts again, I think that I could get your point. There is indeed some issue in the way the Python API for UMAP is currently designed. But, this should however not prevent users from using the precomputed KNN graph feature.

Indeed, when using the knn_graph parameter, the Python API currently requires an unnecessary and unused X input. This problem should be fixed from the Python side as the KNN graph is still used as expected.

>>> import cupy as cp
>>> y = model.fit_transform(X, knn_graph=knn_graph, convert_dtype=True)
>>> z = model.fit_transform(cp.random.rand(n_samples, n_features), knn_graph=knn_graph, convert_dtype=True)
>>> cp.allclose(y, z)
array(True)

Created PR https://github.com/rapidsai/cuml/pull/4865

viclafargue avatar Aug 10 '22 09:08 viclafargue

@RichieHakim

Unfortunately, the choice to not support the precomputed metric option was intentional here because a full pairwise distance matrix requires n_samples * n_samples entries and does not scale well in memory constrained environments. Instead, we provide the option to accept a knn graph, which can also be computed using the RAPIDS NearestNeighbors estimator, because that takes the required space down to the order n_sample * n_neighbors. Assuming we can get your code returning the correct results, does the knn_graph argument satisfy your needs? We could make a feature request to update these algorithms so they accept a precomputed pairwise distance matrix but it would be helpful to know more about your use-case here.

As far as the correctness of the existing knn_graph option- it will be helpful if you provide a small code snippet that demonstrates how you used the knn_graph argument so that we can determine if this is a legitimate bug or if an update to the documentation might be needed. The argument should work, but it is possible it might return unexpected results if not used as intended.

That method doesn't work. Same as here: https://github.com/rapidsai/cuml/issues/4460#issuecomment-1011539376 I believe.

HDBSCAN does not accept a precomputed pairwise distance matrix nor does it accept a precomputed knn graph. Unfortunately, accepting the precomputed knn graph in HDBSCAN is a lot more involved than UMAP and TSNE.

cjnolet avatar Aug 25 '22 18:08 cjnolet

@RichieHakim Looking through some of your other issues, I think I understand now what you are looking for- you already have pairwise distance matrix (and we might not be able to assume here that you have access to the original data), and you'd like to be able to invoke UMAP and TSNE w/ it.

Can you give us an idea of the size of this pairwise distance matrix? While we work to prioritize this feature, would it be helpful if you were able to convert your pairwise distance matrix into a valid knn graph by using something like cupy to do a sort/argsort and then use the knn_graph argument from there?

cjnolet avatar Aug 25 '22 18:08 cjnolet

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Sep 24 '22 19:09 github-actions[bot]