cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[BUG] Multi GPU KMeans memory usage is 2x larger than expected.

Open tfeher opened this issue 8 months ago • 19 comments

Describe the bug

The expected GPU memory usage of cuML's kmeans algorithm is (n_rows * n_cols + n_cluster*n_cols) * sizeof(MathT), where (n_rows, n_cols) is the shape of the input matrix. In parctice n_clusters is much less than n_rows, therefore the memory usage is expected to be only slightly larger than the input data size.

Here is a code to demonstrate memory usage of single GPU k-means

import rmm
upstream = rmm.mr.get_current_device_resource()
mr = rmm.mr.StatisticsResourceAdaptor(upstream)
rmm.mr.set_current_device_resource(mr)

from rmm.allocators.cupy import rmm_cupy_allocator
import cupy as cp
cp.cuda.set_allocator(rmm_cupy_allocator)
from cuml.cluster import KMeans
import numpy as np

dataset = cp.random.uniform(size=(4000000, 250), dtype=cp.float32)
print("Input dataset {}x{} {:6.1f} GB".format(dataset.shape[0], dataset.shape[1], dataset.size * dataset.dtype.itemsize / 1e9))
print("Peak memory usage after allocating input data:",mr.allocation_counts["peak_bytes"]/(1e9), "GB")

kmeans = KMeans(n_clusters=10000, max_iter=4, init='random')
kmeans.fit(dataset)
print("Peak memory after KMeans fit:",mr.allocation_counts["peak_bytes"]/(1e9), "GB")

Expected output

Input dataset 4000000x250    4.0 GB
Peak memory usage after allocating input data: 4.0 GB
Peak memory after KMeans fit: 4.142093072 GB

In contrast when using cuml.dask.cluster.KMeans the memory usage is twice as large.

Steps/Code to reproduce bug

# dask_experiment.py

import dask
import dask.array as da
from dask import config as cfg
cfg.set({'distributed.scheduler.worker-ttl': None})

import cupy as cp
import numpy as np
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
from raft_dask.common import Comms
from cuml.dask.cluster import KMeans


if __name__ == "__main__":
    n_gpus = 1
    cluster = LocalCUDACluster(n_workers=n_gpus, memory_limit=0) 
    client = Client(cluster)
    comms = Comms(client=client)
    comms.init()
    
    n_rows = 4000000*n_gpus
    n_cols = 250


    dataset = da.random.random((n_rows, n_cols), chunks=(n_rows/n_gpus, n_cols)).astype(np.float32)
    
    #@dask.delayed
    def to_gpu(x):
        return cp.asarray(x)

    dataset_gpu = da.map_blocks(to_gpu, dataset, dtype=cp.float32)
    dataset_gpu = dataset_gpu.persist()
    wait(dataset_gpu)

    print("Input dataset {}x{} {:6.1f} GB".format(dataset_gpu.shape[0], dataset_gpu.shape[1], dataset_gpu.size * dataset_gpu.dtype.itemsize / 1e9))

    print("starting clustering")
    kmeans = KMeans(n_clusters=1000, max_iter=4, init='random')
    kmeans.fit(dataset_gpu)

    comms.destroy()
    client.close()
    cluster.close()

We will use nvidia-smi to monitor memory usage

nvidia-smi -i 0 --query-gpu=index,timestamp,memory.used,memory.total --format=csv -l 1 & bkg_pid=$!; python dask_experiment.py; kill $bkg_pid

Output

[2] 450782
index, timestamp, memory.used [MiB], memory.total [MiB]
0, 2024/06/14 15:20:23.433, 0 MiB, 16384 MiB
...
0, 2024/06/14 15:20:41.438, 985 MiB, 16384 MiB
0, 2024/06/14 15:20:42.439, 4801 MiB, 16384 MiB
0, 2024/06/14 15:20:43.439, 4801 MiB, 16384 MiB
Input dataset 4000000x250    4.0 GB
starting clustering
0, 2024/06/14 15:20:44.439, 8883 MiB, 16384 MiB
1000000000
0, 2024/06/14 15:20:45.439, 8981 MiB, 16384 MiB

We can see from the output that after allocating 4.0 GB input array we had 4801 MiB memory usage, which went up to 8883 MiB after we have started clustering. This is not expected.

Expected behavior It is expected that the multi-GPU kmeans implementation has similar memory usage as the single GPU: it shall only need slightly large space than the local chunk of input data.

Environment details (please complete the following information): Checked with rapids 24.04 and 24.06 on various GPUs (V100, A100, A30).

tfeher avatar Jun 14 '24 22:06 tfeher