cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[FEA] Support for Dice coefficient as a metric in UMAP

Open beckernick opened this issue 2 years ago • 4 comments

I'd like to be able to use the Dice coefficient for UMAP like I can on the CPU. A small number of users choose the Dice metric based on this Github search.

But, as RAFT supports the Dice coefficient as a metric and cuML's DistanceType enum already supports DiceExpanded, this may be as simple as adding it a supported metric in the UMAP metric mapping dictionary.

import cuml
import umap

X, _ = cuml.datasets.make_blobs()

reducer = umap.UMAP(metric="dice")
print(reducer.fit_transform(X.get())[:5])
/home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.8/site-packages/umap/umap_.py:1802: UserWarning: gradient function is not yet implemented for dice distance metric; inverse_transform will be unavailable
  warn(
[[-1.687048   15.136897  ]
 [-1.1637341  14.760551  ]
 [ 1.4106333  14.12161   ]
 [-0.07474275 13.303681  ]
 [ 1.5589024  16.217863  ]]

reducer = cuml.manifold.umap.UMAP(metric="dice")
print(reducer.fit_transform(X)[:5])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [14], in <cell line: 10>()
      7 print(reducer.fit_transform(X.get())[:5])
      9 reducer = cuml.manifold.umap.UMAP(metric="dice")
---> 10 print(reducer.fit_transform(X)[:5])

File ~/miniconda3/envs/rapids-23.02/lib/python3.8/site-packages/cuml/internals/api_decorators.py:548, in BaseReturnArrayDecorator.__call__.<locals>.inner_set_get(*args, **kwargs)
    545         self.do_getters_with_self_no_input(self_val=self_val)
    547     # Call the function
--> 548     ret_val = func(*args, **kwargs)
    550 return cm.process_return(ret_val)

File ~/miniconda3/envs/rapids-23.02/lib/python3.8/site-packages/cuml/internals/api_decorators.py:817, in enable_device_interop.<locals>.dispatch(self, *args, **kwargs)
    815 if hasattr(self, 'dispatch_func'):
    816     func_name = gpu_func.__name__
--> 817     return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
    818 else:
    819     return gpu_func(self, *args, **kwargs)

File ~/miniconda3/envs/rapids-23.02/lib/python3.8/site-packages/cuml/internals/api_decorators.py:359, in ReturnAnyDecorator.__call__.<locals>.inner(*args, **kwargs)
    356 @wraps(func)
    357 def inner(*args, **kwargs):
    358     with self._recreate_cm(func, args):
--> 359         return func(*args, **kwargs)

File base.pyx:656, in cuml.internals.base.UniversalBase.dispatch_func()

File umap.pyx:659, in cuml.manifold.umap.UMAP.fit_transform()

File ~/miniconda3/envs/rapids-23.02/lib/python3.8/site-packages/cuml/internals/api_decorators.py:408, in BaseReturnAnyDecorator.__call__.<locals>.inner_with_setters(*args, **kwargs)
    401 self_val, input_val, target_val = \
    402     self.get_arg_values(*args, **kwargs)
    404 self.do_setters(self_val=self_val,
    405                 input_val=input_val,
    406                 target_val=target_val)
--> 408 return func(*args, **kwargs)

File ~/miniconda3/envs/rapids-23.02/lib/python3.8/site-packages/cuml/internals/api_decorators.py:817, in enable_device_interop.<locals>.dispatch(self, *args, **kwargs)
    815 if hasattr(self, 'dispatch_func'):
    816     func_name = gpu_func.__name__
--> 817     return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
    818 else:
    819     return gpu_func(self, *args, **kwargs)

File ~/miniconda3/envs/rapids-23.02/lib/python3.8/site-packages/cuml/internals/api_decorators.py:359, in ReturnAnyDecorator.__call__.<locals>.inner(*args, **kwargs)
    356 @wraps(func)
    357 def inner(*args, **kwargs):
    358     with self._recreate_cm(func, args):
--> 359         return func(*args, **kwargs)

File base.pyx:656, in cuml.internals.base.UniversalBase.dispatch_func()

File umap.pyx:569, in cuml.manifold.umap.UMAP.fit()

File umap.pyx:465, in cuml.manifold.umap.UMAP._build_umap_params()

ValueError: Invalid value for metric: dice

(Using the 23.02.00a230112 cuda11_py38_g69db20fd6_77 nightly conda package).

beckernick avatar Jan 12 '23 18:01 beckernick

@beckernick the beauty of dice distance is that it can be computed w/ a dot product and norms. This means it's super easy to compute this on a dense matrix as well if that's also going to be needed.

cjnolet avatar Jan 19 '23 21:01 cjnolet

This seems to work locally for me on dense matrices with the existing machinery (using #5140). As a note, for sparse matrices we need to explicitly coerce the input data to Boolean type for the dice metric calculation to not throw a memory error.

We can file an issue to follow up on that, if desired. This applies to the existing Jaccard metric, too. We should probably catch this error and fail gracefully

beckernick avatar Jan 19 '23 21:01 beckernick

This applies to the existing Jaccard metric, too.

Jaccard doesn't look like it's supported in the dense brute-force API either. I wonder if the test is returning all zeros or something. I'm honestly surprised the C++ is not outright throwing an error.

cjnolet avatar Jan 19 '23 21:01 cjnolet

Hi @beckernick, do you know whether having the dice metric is still useful in UMAP or if we have any specific users that are asking for it? cc @csadorf

aamijar avatar Nov 26 '25 00:11 aamijar