cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[BUG] cuml.dask.cluster.KMeans.labels_ gives only single partition results

Open VibhuJawa opened this issue 1 year ago • 1 comments

Describe the bug Currently cuml.dask.cluster.KMeans.labels_ gives only single partition results instead of giving results across partitions.

Steps/Code to reproduce bug

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask
from cuml.dask.cluster import KMeans
import dask.array as da


cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="1,2,3,4")    
client = Client(cluster)

cupy_darr = da.random.random((10000, 100), chunks=(2500, 100)).to_backend("cupy")
cupy_darr.compute_chunk_sizes() 
kmeans = KMeans(n_clusters=100, init_max_iter=1000, oversampling_factor=10)
dist_to_cents = kmeans.fit_transform(cupy_darr)
labels_found = kmeans.labels_
expected_labels = kmeans.predict(cupy_darr).compute()
print(len(kmeans.labels_), len(kmeans.predict(cupy_darr).compute()))
assert len(labels_found)==len(expected_labels)
2500 10000
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[7], line 8
      6 expected_labels = kmeans.predict(cupy_darr).compute()
      7 print(len(kmeans.labels_), len(kmeans.predict(cupy_darr).compute()))
----> 8 assert len(labels_found)==len(expected_labels)

AssertionError: 

Expected behavior/

I wold expect labels_ results to line up with predict.

Environment details (please complete the following information):

  • Environment location: [Bare-metal]
  • Linux Distro/Architecture: [Ubuntu 16.04 amd64]
  • GPU Model/Driver: [V100 and driver 396.44]
  • CUDA: [12.0]

Additional context

CC: @dantegd

VibhuJawa avatar Jun 05 '24 07:06 VibhuJawa

The issue is pretty simple, we are currently returning the attributes of the local worker models which works great for most attributes of most estimators, but for labels_ it makes it so that it only returns a subset of the lables (from the first worker if I'm not mistaken). Should be pretty easy to build a dask array from the labels of all the local models.

dantegd avatar Jun 05 '24 17:06 dantegd