cuml
cuml copied to clipboard
[BUG] Multi GPU KMeans memory usage is 2x larger than expected.
Describe the bug
The expected GPU memory usage of cuML's kmeans algorithm is (n_rows * n_cols + n_cluster*n_cols) * sizeof(MathT)
, where (n_rows, n_cols)
is the shape of the input matrix. In parctice n_clusters
is much less than n_rows,
therefore the memory usage is expected to be only slightly larger than the input data size.
Here is a code to demonstrate memory usage of single GPU k-means
import rmm
upstream = rmm.mr.get_current_device_resource()
mr = rmm.mr.StatisticsResourceAdaptor(upstream)
rmm.mr.set_current_device_resource(mr)
from rmm.allocators.cupy import rmm_cupy_allocator
import cupy as cp
cp.cuda.set_allocator(rmm_cupy_allocator)
from cuml.cluster import KMeans
import numpy as np
dataset = cp.random.uniform(size=(4000000, 250), dtype=cp.float32)
print("Input dataset {}x{} {:6.1f} GB".format(dataset.shape[0], dataset.shape[1], dataset.size * dataset.dtype.itemsize / 1e9))
print("Peak memory usage after allocating input data:",mr.allocation_counts["peak_bytes"]/(1e9), "GB")
kmeans = KMeans(n_clusters=10000, max_iter=4, init='random')
kmeans.fit(dataset)
print("Peak memory after KMeans fit:",mr.allocation_counts["peak_bytes"]/(1e9), "GB")
Expected output
Input dataset 4000000x250 4.0 GB
Peak memory usage after allocating input data: 4.0 GB
Peak memory after KMeans fit: 4.142093072 GB
In contrast when using cuml.dask.cluster.KMeans
the memory usage is twice as large.
Steps/Code to reproduce bug
# dask_experiment.py
import dask
import dask.array as da
from dask import config as cfg
cfg.set({'distributed.scheduler.worker-ttl': None})
import cupy as cp
import numpy as np
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
from raft_dask.common import Comms
from cuml.dask.cluster import KMeans
if __name__ == "__main__":
n_gpus = 1
cluster = LocalCUDACluster(n_workers=n_gpus, memory_limit=0)
client = Client(cluster)
comms = Comms(client=client)
comms.init()
n_rows = 4000000*n_gpus
n_cols = 250
dataset = da.random.random((n_rows, n_cols), chunks=(n_rows/n_gpus, n_cols)).astype(np.float32)
#@dask.delayed
def to_gpu(x):
return cp.asarray(x)
dataset_gpu = da.map_blocks(to_gpu, dataset, dtype=cp.float32)
dataset_gpu = dataset_gpu.persist()
wait(dataset_gpu)
print("Input dataset {}x{} {:6.1f} GB".format(dataset_gpu.shape[0], dataset_gpu.shape[1], dataset_gpu.size * dataset_gpu.dtype.itemsize / 1e9))
print("starting clustering")
kmeans = KMeans(n_clusters=1000, max_iter=4, init='random')
kmeans.fit(dataset_gpu)
comms.destroy()
client.close()
cluster.close()
We will use nvidia-smi to monitor memory usage
nvidia-smi -i 0 --query-gpu=index,timestamp,memory.used,memory.total --format=csv -l 1 & bkg_pid=$!; python dask_experiment.py; kill $bkg_pid
Output
[2] 450782
index, timestamp, memory.used [MiB], memory.total [MiB]
0, 2024/06/14 15:20:23.433, 0 MiB, 16384 MiB
...
0, 2024/06/14 15:20:41.438, 985 MiB, 16384 MiB
0, 2024/06/14 15:20:42.439, 4801 MiB, 16384 MiB
0, 2024/06/14 15:20:43.439, 4801 MiB, 16384 MiB
Input dataset 4000000x250 4.0 GB
starting clustering
0, 2024/06/14 15:20:44.439, 8883 MiB, 16384 MiB
1000000000
0, 2024/06/14 15:20:45.439, 8981 MiB, 16384 MiB
We can see from the output that after allocating 4.0 GB input array we had 4801 MiB memory usage, which went up to 8883 MiB after we have started clustering. This is not expected.
Expected behavior It is expected that the multi-GPU kmeans implementation has similar memory usage as the single GPU: it shall only need slightly large space than the local chunk of input data.
Environment details (please complete the following information): Checked with rapids 24.04 and 24.06 on various GPUs (V100, A100, A30).