[BUG] cuml.dask.cluster.KMeans.labels_ gives only single partition results
Describe the bug Currently cuml.dask.cluster.KMeans.labels_ gives only single partition results instead of giving results across partitions.
Steps/Code to reproduce bug
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask
from cuml.dask.cluster import KMeans
import dask.array as da
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="1,2,3,4")
client = Client(cluster)
cupy_darr = da.random.random((10000, 100), chunks=(2500, 100)).to_backend("cupy")
cupy_darr.compute_chunk_sizes()
kmeans = KMeans(n_clusters=100, init_max_iter=1000, oversampling_factor=10)
dist_to_cents = kmeans.fit_transform(cupy_darr)
labels_found = kmeans.labels_
expected_labels = kmeans.predict(cupy_darr).compute()
print(len(kmeans.labels_), len(kmeans.predict(cupy_darr).compute()))
assert len(labels_found)==len(expected_labels)
2500 10000
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[7], line 8
6 expected_labels = kmeans.predict(cupy_darr).compute()
7 print(len(kmeans.labels_), len(kmeans.predict(cupy_darr).compute()))
----> 8 assert len(labels_found)==len(expected_labels)
AssertionError:
Expected behavior/
I wold expect labels_ results to line up with predict.
Environment details (please complete the following information):
- Environment location: [Bare-metal]
- Linux Distro/Architecture: [Ubuntu 16.04 amd64]
- GPU Model/Driver: [V100 and driver 396.44]
- CUDA: [12.0]
Additional context
CC: @dantegd
The issue is pretty simple, we are currently returning the attributes of the local worker models which works great for most attributes of most estimators, but for labels_ it makes it so that it only returns a subset of the lables (from the first worker if I'm not mistaken). Should be pretty easy to build a dask array from the labels of all the local models.