cupy cp.unique runs forever

Description

In an attempt to measure the performance of cp.unique (following #8307), I noticed something very unpleasant: it doesn't return for large arrays.

I expect something comparable to Jax numbers:

import jax.numpy as jnp

N, M = 1_000_000, 10
arr = np.random.randint(0, 2, (N, M), dtype=np.uint8)
gpu_array = jnp.asarray(arr)

>>> %timeit jnp.unique(gpu_array, axis=0).block_until_ready()
28.9 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

To Reproduce

First small arrays

import cupy as cp
from cupyx.profiler import benchmark

N, M = 32, 10
arr = cp.random.randint(0, 2, (N, M), dtype=cp.uint8)

>>> benchmark(cp.unique, (arr,), {'axis': 0}, n_repeat=100)
unique              :    CPU:  9660.600 us   +/- 70.012 (min:  9548.863 / max:  9925.441) us     GPU-0:  9665.085 us   +/- 70.181 (min:  9553.280 / max:  9930.688) us

Bigger array, but benchmarking any other function (e.g. cp.sum) to check that it returns:

N, M = 1_000_000, 10
arr = cp.random.randint(0, 2, (N, M), dtype=cp.uint8)

>>> benchmark(cp.sum, (arr,), {'axis': 0}, n_repeat=100)
sum                 :    CPU:    17.986 us   +/- 16.365 (min:    11.146 / max:   112.660) us     GPU-0: 19225.186 us   +/- 29.842 (min: 19187.712 / max: 19329.023) us

A single run with this size of cp.unique keeps running (after an hour, it was still running).

>>> benchmark(cp.unique, (arr,), {'axis': 0}, n_repeat=1)
...

Installation

Conda-Forge (conda install ...)

Environment

OS                           : Linux-6.5.0-1023-oem-x86_64-with-glibc2.35
Python Version               : 3.10.14
CuPy Version                 : 13.1.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.26.4
SciPy Version                : None
Cython Build Version         : 0.29.37
Cython Runtime Version       : None
CUDA Root                    : /usr/local/cuda
nvcc PATH                    : /usr/local/cuda/bin/nvcc
CUDA Build Version           : 12040
CUDA Driver Version          : 12040
CUDA Runtime Version         : 12040 (linked to CuPy) / 12040 (locally installed)
cuBLAS Version               : (available)
cuFFT Version                : 11201
cuRAND Version               : 10305
cuSOLVER Version             : (11, 6, 1)
cuSPARSE Version             : (available)
NVRTC Version                : (12, 4)
Thrust Version               : 200302
CUB Build Version            : 200200
Jitify Build Version         : <unknown>
cuDNN Build Version          : 8907
cuDNN Version                : 8907
NCCL Build Version           : 22105
NCCL Runtime Version         : 22105
cuTENSOR Version             : 20001
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA RTX A500 Laptop GPU
Device 0 Compute Capability  : 86
Device 0 PCI Bus ID          : 0000:03:00.0

Additional Information

No response

May 16 '24 19:05 essoca

Thanks for the feedback @essoca, confirmed on my side as well. Support for axis argument in cupy.unique is relatively new (#6886 cc/ @andfoy) and looks like there is room for improvement, especially in the case that the length of the axis specified in the ndarray is large.

May 17 '24 07:05 kmaehashi

Thanks for the report! As @kmaehashi mentioned, this operation has a ton of room for improvement, I'll take a look for potential optimizations

May 17 '24 12:05 andfoy