cuml
cuml copied to clipboard
[BUG] Dask + UMAP does not work with numpy array.
Describe the bug When Using Dask + UMAP to use multiple gpus, if a input array is np.array not cupy array, then dask error raises.
ValueError: could not broadcast input array from shape (7,1) into shape (7,)
If I cast the input array into cupy array, it runs without error.
Below is the code.
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask.array as da
from cuml.manifold import UMAP
from cuml.dask.manifold import UMAP as MNMG_UMAP
import numpy as np
import cupy
if __name__ == "__main__":
cluster = LocalCUDACluster(n_workers=2)
client = Client(cluster)
X = np.zeros((100, 10, 49), dtype=np.float32).reshape(100, -1)
# X = cupy.asarray(X)
print(X.shape, type(X))
local_model = UMAP(random_state=10, n_components=1)
val = local_model.fit_transform(X)
distributed_model = MNMG_UMAP(model=local_model)
distributed_X = da.from_array(X, chunks=(7, -1))
embedding = distributed_model.transform(distributed_X)
result = embedding.compute()
client.close()
cluster.close()
If I uncomment the "X = cupy.asarray(X)", then it runs without error.
- Environment location: Docker
- Linux Distro/Architecture: Ubuntu 20.04 amd64, kernel version=5.4.0-171-generic
- GPU Model/Driver: 4 * RTX 3090, 550.76
- CUDA: 12.2
- Method of cuDF & cuML install: conda command: conda create -n rapids-24.04 -c rapidsai -c conda-forge -c nvidia rapids=24.04 python=3.11 cuda-version=12.2 h5py matplotlib
Thanks for the issue @nahaharo, this is useful feedback, it's something in the backlog, we will add it in the future, but no ETA currently.