cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[BUG] silhouette_score / classlabels.py check_labels function throws CUDA_ERROR_INVALID_VALUE with large labels/X sizes.

Open goraj opened this issue 3 years ago • 8 comments

Describe the bug silhouette_score will lead to CUDA error when running check_labels on larger array sizes.

Steps/Code to reproduce bug

import numpy as np
from cuml.metrics.cluster import silhouette_score


n = 35000
labels = np.random.choice(np.arange(15000), replace=True, size=n)
X = np.random.normal(size=(n, 2))
silhouette_score(X, labels)

Error:

---------------------------------------------------------------------------
CUDADriverError                           Traceback (most recent call last)
Cell In [1], line 6
      4 labels = np.random.choice(np.arange(15000), replace=True, size=99768)
      5 X = np.random.normal(size=(99768, 2))
----> 6 silhouette_score(X, labels)

File silhouette_score.pyx:192, in cuml.metrics.cluster.silhouette_score.cython_silhouette_score()

File silhouette_score.pyx:109, in cuml.metrics.cluster.silhouette_score._silhouette_coeff()

File ~/miniconda3/envs/research/lib/python3.9/site-packages/cuml/internals/api_decorators.py:360, in ReturnAnyDecorator.__call__.<locals>.inner(*args, **kwargs)
    357 @wraps(func)
    358 def inner(*args, **kwargs):
    359     with self._recreate_cm(func, args):
--> 360         return func(*args, **kwargs)

File ~/miniconda3/envs/research/lib/python3.9/site-packages/cuml/prims/label/classlabels.py:199, in check_labels(labels, classes)
    197 smem = labels.dtype.itemsize * int(classes.shape[0])
    198 validate = _validate_kernel(labels.dtype)
--> 199 validate((math.ceil(labels.shape[0] / 32),), (32, ),
    200          (labels, labels.shape[0], classes,
    201          classes.shape[0], valid),
    202          shared_mem=smem)
    204 return valid[0] == 1

File cupy/_core/raw.pyx:89, in cupy._core.raw.RawKernel.__call__()

File cupy/cuda/function.pyx:224, in cupy.cuda.function.Function.__call__()

File cupy/cuda/function.pyx:206, in cupy.cuda.function._launch()

File cupy_backends/cuda/api/driver.pyx:263, in cupy_backends.cuda.api.driver.launchKernel()

File cupy_backends/cuda/api/driver.pyx:60, in cupy_backends.cuda.api.driver.check_status()

CUDADriverError: CUDA_ERROR_INVALID_VALUE: invalid argument

Expected behavior I notice CUDA exceptions starting with a certain size of labels / X. Change via n. Behaviour is independent of chunksize passed.

Environment details (please complete the following information):

  • Environment location: [Ubuntu AMI]
  • Linux Distro/Architecture: [Ubuntu 22.04.1 LTS x86_64]
  • GPU Model/Driver: [Tesla T4/Driver Version: 515.65.01]
  • CUDA: [CUDA Version: 11.7]
  • Method of cuDF & cuML install: [conda]
cuml                      22.10.00a220914 cuda11_py39_g9b3d15088_43    rapidsai-nightly
libcuml                   22.10.00a220914 cuda11_g9b3d15088_43    rapidsai-nightly
libcumlprims              22.10.00a220804 cuda11_g2adfe69_0    rapidsai-nightly

goraj avatar Oct 11 '22 00:10 goraj

I think there's an issue with the overuse of shared memory in the code that checks the labels. cc @cjnolet

viclafargue avatar Oct 21 '22 09:10 viclafargue

Any update on this @viclafargue @cjnolet ?

goraj avatar Nov 03 '22 15:11 goraj

Getting a CUDA_ERROR_ILLEGAL_ADDRESS with large number of samples (~4.7 million). I have tried reducing chunksize down to just 100; still same error. The same script runs fine with 100,000 samples.

adaruna3 avatar Jun 09 '23 00:06 adaruna3

@adaruna3 I met the similar issue like you. Do you get fixed, or just switch to another package like scikit-learn?

djz233 avatar Jan 12 '24 06:01 djz233

@djz233 @adaruna3 I am seeing the same issue with my samples with large cluster labels. I currently switched to sklearn, any ideas or updates on any potential fixes?

adityak74 avatar Jan 15 '24 05:01 adityak74

Not sure if this is related, but I can see the smem at 49524 and max shared size bytes is 49152. @viclafargue

image

adityak74 avatar Jan 15 '24 06:01 adityak74

I'm encountering the same error: CUDADriverError: CUDA_ERROR_INVALID_VALUE: invalid argument when working with a large dataset (7.5 million). I also attempted to run UMAP and HDBSCAN on entire dataset and then calculate the silhouette score on 5% sample, but the error persists! Any updates @cjnolet?

raina678 avatar Jun 06 '24 18:06 raina678

@adityak74 this is indeed the issue. Could work on a fix for this : https://github.com/rapidsai/cuml/pull/5971.

viclafargue avatar Jul 23 '24 12:07 viclafargue