cp.unique runs forever
Description
In an attempt to measure the performance of cp.unique (following #8307), I noticed something very unpleasant: it doesn't return for large arrays.
I expect something comparable to Jax numbers:
import jax.numpy as jnp
N, M = 1_000_000, 10
arr = np.random.randint(0, 2, (N, M), dtype=np.uint8)
gpu_array = jnp.asarray(arr)
>>> %timeit jnp.unique(gpu_array, axis=0).block_until_ready()
28.9 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
To Reproduce
First small arrays
import cupy as cp
from cupyx.profiler import benchmark
N, M = 32, 10
arr = cp.random.randint(0, 2, (N, M), dtype=cp.uint8)
>>> benchmark(cp.unique, (arr,), {'axis': 0}, n_repeat=100)
unique : CPU: 9660.600 us +/- 70.012 (min: 9548.863 / max: 9925.441) us GPU-0: 9665.085 us +/- 70.181 (min: 9553.280 / max: 9930.688) us
Bigger array, but benchmarking any other function (e.g. cp.sum) to check that it returns:
N, M = 1_000_000, 10
arr = cp.random.randint(0, 2, (N, M), dtype=cp.uint8)
>>> benchmark(cp.sum, (arr,), {'axis': 0}, n_repeat=100)
sum : CPU: 17.986 us +/- 16.365 (min: 11.146 / max: 112.660) us GPU-0: 19225.186 us +/- 29.842 (min: 19187.712 / max: 19329.023) us
A single run with this size of cp.unique keeps running (after an hour, it was still running).
>>> benchmark(cp.unique, (arr,), {'axis': 0}, n_repeat=1)
...
Installation
Conda-Forge (conda install ...)
Environment
OS : Linux-6.5.0-1023-oem-x86_64-with-glibc2.35
Python Version : 3.10.14
CuPy Version : 13.1.0
CuPy Platform : NVIDIA CUDA
NumPy Version : 1.26.4
SciPy Version : None
Cython Build Version : 0.29.37
Cython Runtime Version : None
CUDA Root : /usr/local/cuda
nvcc PATH : /usr/local/cuda/bin/nvcc
CUDA Build Version : 12040
CUDA Driver Version : 12040
CUDA Runtime Version : 12040 (linked to CuPy) / 12040 (locally installed)
cuBLAS Version : (available)
cuFFT Version : 11201
cuRAND Version : 10305
cuSOLVER Version : (11, 6, 1)
cuSPARSE Version : (available)
NVRTC Version : (12, 4)
Thrust Version : 200302
CUB Build Version : 200200
Jitify Build Version : <unknown>
cuDNN Build Version : 8907
cuDNN Version : 8907
NCCL Build Version : 22105
NCCL Runtime Version : 22105
cuTENSOR Version : 20001
cuSPARSELt Build Version : None
Device 0 Name : NVIDIA RTX A500 Laptop GPU
Device 0 Compute Capability : 86
Device 0 PCI Bus ID : 0000:03:00.0
Additional Information
No response
Thanks for the feedback @essoca, confirmed on my side as well. Support for axis argument in cupy.unique is relatively new (#6886 cc/ @andfoy) and looks like there is room for improvement, especially in the case that the length of the axis specified in the ndarray is large.
Thanks for the report! As @kmaehashi mentioned, this operation has a ton of room for improvement, I'll take a look for potential optimizations