cuml
cuml copied to clipboard
[FEA] TSNE and UMAP: Allow input to be precomputed distance matrix
Is your feature request related to a problem? Please describe. Yes. Currently, TSNE and UMAP do not actually allow for a precomputed distance matrix to be used. This is despite there being arguments that suggest this functionality exists.
Describe the solution you'd like
Implementation of methods to allow for a custom distance matrix to be used as the sole data input to TSNE and/or UMAP. This functionality currently exists in sklearn and would allow for users already using this feature in sklearn's TSNE to directly port their workflow to rapids.
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
metricstr or callable, default=’euclidean’: ...If metric is “precomputed”, X is assumed to be a distance matrix. ...
Describe alternatives you've considered
What currently exists is the knn_graph
argument for the .fit
method. The documentation suggests that allowing for a precomputed distance matrix to be used by providing "a sparse array containing the k-nearest neighbors" should be possible, but this method is not functional.
Additional context This is a highly desirable feature. Similar requests here, here, here
Thanks!
@RichieHakim did you try using the knn_graph
argument and using the fit_transform
API with the data you wish to transform? Or is the request that a workaround be avoided, and the API work exactly as sklearn's?
@divyegala That method doesn't work. Same as here: https://github.com/rapidsai/cuml/issues/4460#issuecomment-1011539376 I believe.
Could you be more specific about what doesn't work for you? Is it with a specific metric? Do you have a crash or simply bad looking results?
Here is an example for UMAP :
from cuml.datasets import make_blobs
from cuml.neighbors import NearestNeighbors
from cuml.manifold.umap import UMAP
n_samples=10000
n_features=20
n_neighbors=5
X, _ = make_blobs(n_samples=n_samples, n_features=n_features)
nn = NearestNeighbors(n_neighbors=n_neighbors)
nn.fit(X)
knn_graph = nn.kneighbors_graph(X, mode="distance")
model = UMAP(random_state=42, init='random', n_neighbors=n_neighbors)
y = model.fit_transform(X, knn_graph=knn_graph.tocsr(), convert_dtype=True)
After reading some of your posts again, I think that I could get your point. There is indeed some issue in the way the Python API for UMAP is currently designed. But, this should however not prevent users from using the precomputed KNN graph feature.
Indeed, when using the knn_graph
parameter, the Python API currently requires an unnecessary and unused X input. This problem should be fixed from the Python side as the KNN graph is still used as expected.
>>> import cupy as cp
>>> y = model.fit_transform(X, knn_graph=knn_graph, convert_dtype=True)
>>> z = model.fit_transform(cp.random.rand(n_samples, n_features), knn_graph=knn_graph, convert_dtype=True)
>>> cp.allclose(y, z)
array(True)
Created PR https://github.com/rapidsai/cuml/pull/4865
@RichieHakim
Unfortunately, the choice to not support the precomputed
metric option was intentional here because a full pairwise distance matrix requires n_samples * n_samples
entries and does not scale well in memory constrained environments. Instead, we provide the option to accept a knn graph, which can also be computed using the RAPIDS NearestNeighbors
estimator, because that takes the required space down to the order n_sample * n_neighbors
. Assuming we can get your code returning the correct results, does the knn_graph
argument satisfy your needs? We could make a feature request to update these algorithms so they accept a precomputed pairwise distance matrix but it would be helpful to know more about your use-case here.
As far as the correctness of the existing knn_graph
option- it will be helpful if you provide a small code snippet that demonstrates how you used the knn_graph
argument so that we can determine if this is a legitimate bug or if an update to the documentation might be needed. The argument should work, but it is possible it might return unexpected results if not used as intended.
That method doesn't work. Same as here: https://github.com/rapidsai/cuml/issues/4460#issuecomment-1011539376 I believe.
HDBSCAN does not accept a precomputed
pairwise distance matrix nor does it accept a precomputed knn graph. Unfortunately, accepting the precomputed knn graph in HDBSCAN is a lot more involved than UMAP and TSNE.
@RichieHakim Looking through some of your other issues, I think I understand now what you are looking for- you already have pairwise distance matrix (and we might not be able to assume here that you have access to the original data), and you'd like to be able to invoke UMAP and TSNE w/ it.
Can you give us an idea of the size of this pairwise distance matrix? While we work to prioritize this feature, would it be helpful if you were able to convert your pairwise distance matrix into a valid knn graph by using something like cupy to do a sort/argsort and then use the knn_graph
argument from there?
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.