cuml [FEA] Reduce memory pressure in HDBSCAN all_points_membership_vectors (or provide/link to best practices)

HDBSCAN now supports soft clustering with all_points_membership_vectors. This functionality has such high performance it's possible to use this on large datasets. However, memory pressure can quickly become an issue with some combinations of HDBSCAN parameters (including the default) for large datasets for which HDBSCAN wants to find identify many clusters.

To the extent that we can reduce pressure, we should explore it. Performance is so fast compared to CPUs that it may be worth exploring tradeoffs for the user experience. We could also look into documenting this behavior and best practices, or link to existing guides if they exist.

To illustrate with a real-world example, I took the Million News Headlines dataset from Kaggle and encoded each document into an embedding space using the all-MiniLM-L6-v2 transformer model. I then selected the first 200,000 documents and used cuML's UMAP to reduce the dimensionality to 5. This derived dataset is available here (NVIDIA).

In this example, memory spikes to about 40 GB on my 48GB RTX 8000 GPU. Scaling to larger datasets becomes infeasible.

import cuml
import numpy as np
import pandas as pd

X = np.load("million_news_articles_embeddings_reduced_200000_5.npy")

clusterer = cuml.cluster.hdbscan.HDBSCAN(
    prediction_data=True
)
clusterer.fit(X)
soft_cluster = cuml.cluster.hdbscan.all_points_membership_vectors(clusterer)
pd.Series(clusterer.labels_).nunique()
5095

I can reduce the pressure to 5 GB by changing the min_samples parameter to 20 (from the default of 5):

import cuml
import numpy as np
import pandas as pd

X = np.load("million_news_articles_embeddings_reduced_200000_5.npy")

clusterer = cuml.cluster.hdbscan.HDBSCAN(
    min_samples=20,
    prediction_data=True,
)
clusterer.fit(X)
soft_cluster = cuml.cluster.hdbscan.all_points_membership_vectors(clusterer)
pd.Series(clusterer.labels_).nunique()
1548

I observe this behavior in the current 22.10 nightly as well as when I build from source and include https://github.com/rapidsai/cuml/pull/4872

Reproducing this behavior with synthetic data may require intentionally creating many low standard deviation clusters. With the default configuration, I observe memory spikes of up to about 20GB.

from sklearn.datasets import make_blobs
import cuml
import numpy as np
import pandas as pd

X, y = make_blobs(
    n_samples=200000,
    centers=2000,
    n_features=5,
    cluster_std=0.1,
    random_state=0
)

clusterer = cuml.cluster.hdbscan.HDBSCAN(prediction_data=True)
clusterer.fit(X)
soft_cluster = cuml.cluster.hdbscan.all_points_membership_vectors(clusterer)
pd.Series(clusterer.labels_).nunique()
1999

Aug 31 '22 13:08 beckernick

As a note, the memory pressure is not observed during fit, regardless of whether the PredictionData is being generated.

import cuml
import numpy as np
import pandas as pd

X = np.load("million_news_articles_embeddings_reduced_200000_5.npy")

clusterer = cuml.cluster.hdbscan.HDBSCAN(
    prediction_data=True
)
clusterer.fit(X)

Aug 31 '22 14:08 beckernick

I think there might be two culprits here:

Allocating the membership vectors itself: for topic modeling tasks, we have observed that there are a large number of clusters and therefore allocating n_samples * n_clusters on the device can be memory intensive. One possible direction for the future to fix this is to add a hyperparameter called batch_size to all_points_membership_vectors and do computation of membership vectors in batches.
A lot of clusters also means a lot of exemplars and the pairwise distance matrix of shape n_samples * n_exemplars, which is used for computing distance memberships, can also be very large. Again using a batch_size might fix the issue.

Aug 31 '22 14:08 tarang-jain

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Oct 01 '22 17:10 github-actions[bot]

Thanks for implementing this feature!

Has anyone had issues using all_points_membership_vectors with a large dataset? (for me, original embeddings are 235002 x 384). It causes my Python kernel to fail. Apologies if this isn't enough detail.

Originally posted by @hatemr in https://github.com/rapidsai/cuml/issues/4724#issuecomment-1370284884

@hatemr can you provide the error you are getting when the Python kernel fails and describe your environment (type of GPU, etc...)?

Originally posted by @cjnolet in https://github.com/rapidsai/cuml/issues/4724#issuecomment-1370288634

g4dn.8xlarge

Originally posted by @hatemr in https://github.com/rapidsai/cuml/issues/4724#issuecomment-1370295491

We've observed that there can be challenges with memory pressure at scale, usually if there are a large number of clusters. @hatemr , could you also share the HDBSCAN configuration you're using?

If you're able to use larger values for min_samples, the workload will require less memory.

Jan 03 '23 23:01 beckernick

HDBSCAN(min_cluster_size=20,
                  metric='euclidean',
                  cluster_selection_method='eom',
                  prediction_data=True)

Thanks, I'll try adjusting min_samples

Jan 03 '23 23:01 hatemr

HDBSCAN(min_cluster_size=20,
                  metric='euclidean',
                  cluster_selection_method='eom',
                  prediction_data=True)

Thanks, I'll try adjusting min_samples

Increasing min_samples worked! I had to increase min_samples itself - increasing min_cluster_size while keeping min_samples=None didn't work.

In summary, it seems that as you increase your training set size, to avoid crashes you should increase min_samples (and maybe min_cluster_size) too.

Jan 04 '23 15:01 hatemr

Another user ran into this during topic modeling (cross-posting for convenience in addition to the cross-link).

I see you're using calculate_probabilities=True. The difference is likely that in BERTopic v0.13 the HDBSCAN soft probabilities are being calculated with cuML's HDBSCAN, where is in BERTopic v0.12 they were being calculated using the CPU HDBSCAN (or being skipped if you passed cuML's HDBSCAN). This is a good thing in general, as calculating the soft probabilities on 2 million documents with CPU HDBSCAN would likely be very slow (see this blog for some loose benchmarks in which it took 15+ hours on a CPU to do 400K documents vs. 1-2 seconds on a GPU, though results will vary across CPU and GPU types).

Computing the soft probabilities with cuML's HDBSCAN can have significant memory requirements if the number of clusters is very large. We have an open issue to address this, and there are some recommendations for how to work around the memory issues in the meantime by changing the HDBSCAN configuration in the issue. If you prefer to use a very small minimum cluster size, you may need to use the CPU version for now (but you can still use cuML's UMAP).

I'm going to cross-link https://github.com/MaartenGr/BERTopic/issues/922 and recommend we take further discussion to that issue. Since this issue is now a duplicate of #4879 from a cuML perspective, I'm going to close it. But please feel free to continue the discussion in the BERTopic issue or in #4879

Originally posted by @beckernick in https://github.com/rapidsai/cuml/issues/5127#issuecomment-1380440159

Jan 12 '23 14:01 beckernick

I've switched to 23.04 and tried using HDBSCAN with batch_size set to various values, but I still quickly run out of memory. Before I investigate this further, I was wondering whether you have any benchmarks on the memory savings achieved through batching and maybe a general (very coarse) rule of thumb for estimating how much memory might be needed given inputs size and parameters?

Apr 17 '23 13:04 kuchenrolle

Could you share the code you're using (and the number of unique clusters HDBSCAN finds in your data)?

The batching is done for computing the pairwise distance matrix between the data and the examplar points for each cluster. This is the most memory-intensive step, but other steps may still peak at higher memory than available on a given GPU depending on the data. In sample tests, we've seen fairly significant memory reductions, though ultimately it will depend on your data size and the number of clusters.

@tarang-jain , perhaps we could add some documentation about memory best practices?

# %load_ext gpu_memory_profiler 

import cuml
import numpy as np
from sklearn.datasets import make_blobs

X, y = make_blobs(
    n_samples=200000,
    centers=500,
    n_features=5,
    random_state=12
)
X = X.astype("float32")

clusterer = cuml.cluster.HDBSCAN(prediction_data=True, min_samples=5).fit(X)
print(len(np.unique(clusterer.labels_)))
%gpu_memit cuml.cluster.hdbscan.all_points_membership_vectors(clusterer, batch_size=len(X)) # simulating v23.02 behavior
%gpu_memit cuml.cluster.hdbscan.all_points_membership_vectors(clusterer, batch_size=4096) # new default
339
Peak GPU memory: 8101.00 MiB
Peak GPU memory: 2169.00 MiB

Apr 17 '23 14:04 beckernick

Regardless of the batch_size, the GPU memory should be large enough to store the membership vector (n_rows * n_clusters * sizeof(float)), in addition to storing some more cached data that we use to compute the soft cluster memberships. We do not have a rule of thumb as of now, but from our test runs, a batch_size of 4096 gave a reduction in memory usage while maintaining a high volatile GPU Utilization. I would suggest starting with a small batch size of 32-64 and gradually increasing it to get the best performance.

Apr 17 '23 14:04 tarang-jain

Hi, I'm looking into this now, but I can't seem to find that gpu_memory_profiler extension you're using for measuring gpu memory usage - can someone point me to it? @beckernick

Nov 06 '23 13:11 kuchenrolle

Hi @kuchenrolle , would you be able to share some details about your use case and data? We've improved the memory quite a bit using the default batch size of 4096. Do you run into issues at every batch size?

The extension isn't available as a package at the moment, but we can evaluate the feasibility of releasing it. It's essentially just measuring the output of nvidia-smi, so you should be able to visually evaluate the peak memory during your workflow with something like: watch -n 0.25 "nvidia-smi -i 0" (for GPU 0).

Nov 06 '23 21:11 beckernick

I was getting some weird results when using the python nvml bindings and when I was running things in a separate process (to avoid the kernel crashing when I run out of memory), but I've got around that now. The problem is that my dataset is a good bit larger (a bit over 2 million datapoints with 25-100 features), so that even with fairly small batches I run out of memory quickly. This is the code I'm using:

import cuml
import multiprocessing as mp
import os
import subprocess
import threading
import time
import torch

from sklearn.datasets import make_blobs


class GPUMemoryProfiler:
    def __init__(self, gpu=0):
        self.gpu = gpu
        self.running = False

    def monitor_memory(self):
        self.peak_memory_usage = gpu_memory_used(self.gpu)
        while self.running:
            current_memory = gpu_memory_used(self.gpu)
            self.peak_memory_usage = max(self.peak_memory_usage, current_memory)
            time.sleep(0.2)

    def __enter__(self):
        self.running = True
        self.initial_memory_usage = gpu_memory_used(self.gpu)
        self.memory_thread = threading.Thread(target=self.monitor_memory)
        self.memory_thread.start()
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.running = False
        self.memory_thread.join()


def gpu_memory_used(gpu: int = 0):
    command = ['nvidia-smi', f'--id={gpu}', '--query-gpu=memory.used', '--format=csv']
    memory_used_info = subprocess.check_output(command).decode('ascii').split('\n')[-2]
    return int(memory_used_info.split()[0])


def hdbscan_run(n_samples=1_000_000, centers=500, batch_size=4_096):
    X, y = make_blobs(
        n_samples=n_samples,
        centers=centers,
        n_features=50,
        random_state=2311
    )
    X = X.astype("float32")

    clusterer = cuml.cluster.HDBSCAN(prediction_data=True, min_samples=5).fit(X)

    with GPUMemoryProfiler() as batching:
        cuml.cluster.hdbscan.all_points_membership_vectors(clusterer, batch_size=batch_size)

    print(f"Peak memory for {n_samples} samples from {centers} centers with batch size {batch_size} was: {batching.peak_memory_usage}")

def hdbscan_profile():
    mp.set_start_method("spawn")

    for n_samples in (200_000, 500_000, 1_000_000, 2_000_000):
        for centers in (50, 100, 500, 1_000, 2_000, 5_000):
            for batch_size in (10, 4_096, n_samples):
                p = run_in_subprocess_on_gpu(hdbscan_run,
                                             gpu=0,
                                             n_samples=n_samples,
                                             centers=centers,
                                             batch_size=batch_size)
                p.join()
                torch.cuda.empty_cache()

def run_in_subprocess_on_gpu(fun, gpu, *args, **kwargs):
    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu)

    process = mp.Process(target=fun, args=args, kwargs=kwargs)
    process.start()

    return process

And this is the results I get:

n_samples	centers	batch_size	memory (MB)
200000	50	10	1104
200000	50	4096	910
200000	50	200000	910
200000	100	10	1066
200000	100	4096	1144
200000	100	200000	988
200000	500	10	1674
200000	500	4096	1722
200000	500	200000	2056
200000	1000	10	2438
200000	1000	4096	2438
200000	1000	200000	3316
200000	2000	10	3966
200000	2000	4096	4250
200000	2000	200000	12196
200000	5000	10	8548
200000	5000	4096	8952
200000	5000	200000	28214
500000	50	10	1172
500000	50	4096	1184
500000	50	500000	1188
500000	100	10	1364
500000	100	4096	1384
500000	100	500000	1170
500000	500	10	2888
500000	500	4096	3062
500000	500	500000	10056
500000	1000	10	4796
500000	1000	4096	4896
500000	1000	500000	17068
500000	2000	10	8616
500000	2000	4096	8796
500000	2000	500000	30630
500000	5000	10	nan
500000	5000	4096	nan
500000	5000	500000	nan
1000000	50	10	1490
1000000	50	4096	1490
1000000	50	1000000	1608
1000000	100	10	1870
1000000	100	4096	1898
1000000	100	1000000	8322
1000000	500	10	5036
1000000	500	4096	5000
1000000	500	1000000	21960
1000000	1000	10	8852
1000000	1000	4096	8854
1000000	1000	1000000	nan
1000000	2000	10	16370
1000000	2000	4096	16572
1000000	2000	1000000	nan
1000000	5000	10	nan
1000000	5000	4096	nan
1000000	5000	1000000	nan
2000000	50	10	2120
2000000	50	4096	2144
2000000	50	2000000	13528
2000000	100	10	2884
2000000	100	4096	2920
2000000	100	2000000	19662
2000000	500	10	9102
2000000	500	4096	9080
2000000	500	2000000	nan
2000000	1000	10	16732
2000000	1000	4096	16756
2000000	1000	2000000	nan
2000000	2000	10	nan
2000000	2000	4096	nan
2000000	2000	2000000	nan
2000000	5000	10	nan
2000000	5000	4096	nan
2000000	5000	2000000	nan

nan is where cuda ran out of memory on the 40gb h100. It seems going lower than the default batch size doesn't make that much of a difference at a certain point, even at a batch size of 2 I run out of space with my actual data. Does the full n_samples * num_clusters * sizeof(float) need to be on the GPU at all times for the computations? At 4 bytes per float and over 2 million samples, 5000 clusters quickly exhaust the 40gb. Couldn't it stay in RAM and only the current batch would live on the GPU? It's so fast, I'd happily trade some speed for being able to run these larger models.

Dec 01 '23 00:12 kuchenrolle

Any ideas? (:

Dec 19 '23 21:12 kuchenrolle