[BUG] silhouette_score / classlabels.py check_labels function throws CUDA_ERROR_INVALID_VALUE with large labels/X sizes.
Describe the bug silhouette_score will lead to CUDA error when running check_labels on larger array sizes.
Steps/Code to reproduce bug
import numpy as np
from cuml.metrics.cluster import silhouette_score
n = 35000
labels = np.random.choice(np.arange(15000), replace=True, size=n)
X = np.random.normal(size=(n, 2))
silhouette_score(X, labels)
Error:
---------------------------------------------------------------------------
CUDADriverError Traceback (most recent call last)
Cell In [1], line 6
4 labels = np.random.choice(np.arange(15000), replace=True, size=99768)
5 X = np.random.normal(size=(99768, 2))
----> 6 silhouette_score(X, labels)
File silhouette_score.pyx:192, in cuml.metrics.cluster.silhouette_score.cython_silhouette_score()
File silhouette_score.pyx:109, in cuml.metrics.cluster.silhouette_score._silhouette_coeff()
File ~/miniconda3/envs/research/lib/python3.9/site-packages/cuml/internals/api_decorators.py:360, in ReturnAnyDecorator.__call__.<locals>.inner(*args, **kwargs)
357 @wraps(func)
358 def inner(*args, **kwargs):
359 with self._recreate_cm(func, args):
--> 360 return func(*args, **kwargs)
File ~/miniconda3/envs/research/lib/python3.9/site-packages/cuml/prims/label/classlabels.py:199, in check_labels(labels, classes)
197 smem = labels.dtype.itemsize * int(classes.shape[0])
198 validate = _validate_kernel(labels.dtype)
--> 199 validate((math.ceil(labels.shape[0] / 32),), (32, ),
200 (labels, labels.shape[0], classes,
201 classes.shape[0], valid),
202 shared_mem=smem)
204 return valid[0] == 1
File cupy/_core/raw.pyx:89, in cupy._core.raw.RawKernel.__call__()
File cupy/cuda/function.pyx:224, in cupy.cuda.function.Function.__call__()
File cupy/cuda/function.pyx:206, in cupy.cuda.function._launch()
File cupy_backends/cuda/api/driver.pyx:263, in cupy_backends.cuda.api.driver.launchKernel()
File cupy_backends/cuda/api/driver.pyx:60, in cupy_backends.cuda.api.driver.check_status()
CUDADriverError: CUDA_ERROR_INVALID_VALUE: invalid argument
Expected behavior
I notice CUDA exceptions starting with a certain size of labels / X. Change via n. Behaviour is independent of chunksize passed.
Environment details (please complete the following information):
- Environment location: [Ubuntu AMI]
- Linux Distro/Architecture: [Ubuntu 22.04.1 LTS x86_64]
- GPU Model/Driver: [Tesla T4/Driver Version: 515.65.01]
- CUDA: [CUDA Version: 11.7]
- Method of cuDF & cuML install: [conda]
cuml 22.10.00a220914 cuda11_py39_g9b3d15088_43 rapidsai-nightly
libcuml 22.10.00a220914 cuda11_g9b3d15088_43 rapidsai-nightly
libcumlprims 22.10.00a220804 cuda11_g2adfe69_0 rapidsai-nightly
I think there's an issue with the overuse of shared memory in the code that checks the labels. cc @cjnolet
Any update on this @viclafargue @cjnolet ?
Getting a CUDA_ERROR_ILLEGAL_ADDRESS with large number of samples (~4.7 million). I have tried reducing chunksize down to just 100; still same error. The same script runs fine with 100,000 samples.
@adaruna3 I met the similar issue like you. Do you get fixed, or just switch to another package like scikit-learn?
@djz233 @adaruna3 I am seeing the same issue with my samples with large cluster labels. I currently switched to sklearn, any ideas or updates on any potential fixes?
Not sure if this is related, but I can see the smem at 49524 and max shared size bytes is 49152. @viclafargue
I'm encountering the same error: CUDADriverError: CUDA_ERROR_INVALID_VALUE: invalid argument when working with a large dataset (7.5 million). I also attempted to run UMAP and HDBSCAN on entire dataset and then calculate the silhouette score on 5% sample, but the error persists! Any updates @cjnolet?
@adityak74 this is indeed the issue. Could work on a fix for this : https://github.com/rapidsai/cuml/pull/5971.