[FEA] Reduce memory pressure in HDBSCAN all_points_membership_vectors (or provide/link to best practices)
HDBSCAN now supports soft clustering with all_points_membership_vectors. This functionality has such high performance it's possible to use this on large datasets. However, memory pressure can quickly become an issue with some combinations of HDBSCAN parameters (including the default) for large datasets for which HDBSCAN wants to find identify many clusters.
To the extent that we can reduce pressure, we should explore it. Performance is so fast compared to CPUs that it may be worth exploring tradeoffs for the user experience. We could also look into documenting this behavior and best practices, or link to existing guides if they exist.
To illustrate with a real-world example, I took the Million News Headlines dataset from Kaggle and encoded each document into an embedding space using the all-MiniLM-L6-v2 transformer model. I then selected the first 200,000 documents and used cuML's UMAP to reduce the dimensionality to 5. This derived dataset is available here (NVIDIA).
In this example, memory spikes to about 40 GB on my 48GB RTX 8000 GPU. Scaling to larger datasets becomes infeasible.
import cuml
import numpy as np
import pandas as pd
X = np.load("million_news_articles_embeddings_reduced_200000_5.npy")
clusterer = cuml.cluster.hdbscan.HDBSCAN(
prediction_data=True
)
clusterer.fit(X)
soft_cluster = cuml.cluster.hdbscan.all_points_membership_vectors(clusterer)
pd.Series(clusterer.labels_).nunique()
5095
I can reduce the pressure to 5 GB by changing the min_samples parameter to 20 (from the default of 5):
import cuml
import numpy as np
import pandas as pd
X = np.load("million_news_articles_embeddings_reduced_200000_5.npy")
clusterer = cuml.cluster.hdbscan.HDBSCAN(
min_samples=20,
prediction_data=True,
)
clusterer.fit(X)
soft_cluster = cuml.cluster.hdbscan.all_points_membership_vectors(clusterer)
pd.Series(clusterer.labels_).nunique()
1548
I observe this behavior in the current 22.10 nightly as well as when I build from source and include https://github.com/rapidsai/cuml/pull/4872
Reproducing this behavior with synthetic data may require intentionally creating many low standard deviation clusters. With the default configuration, I observe memory spikes of up to about 20GB.
from sklearn.datasets import make_blobs
import cuml
import numpy as np
import pandas as pd
X, y = make_blobs(
n_samples=200000,
centers=2000,
n_features=5,
cluster_std=0.1,
random_state=0
)
clusterer = cuml.cluster.hdbscan.HDBSCAN(prediction_data=True)
clusterer.fit(X)
soft_cluster = cuml.cluster.hdbscan.all_points_membership_vectors(clusterer)
pd.Series(clusterer.labels_).nunique()
1999
As a note, the memory pressure is not observed during fit, regardless of whether the PredictionData is being generated.
import cuml
import numpy as np
import pandas as pd
X = np.load("million_news_articles_embeddings_reduced_200000_5.npy")
clusterer = cuml.cluster.hdbscan.HDBSCAN(
prediction_data=True
)
clusterer.fit(X)
I think there might be two culprits here:
- Allocating the membership vectors itself: for topic modeling tasks, we have observed that there are a large number of clusters and therefore allocating
n_samples * n_clusterson the device can be memory intensive. One possible direction for the future to fix this is to add a hyperparameter calledbatch_sizetoall_points_membership_vectorsand do computation of membership vectors in batches. - A lot of clusters also means a lot of exemplars and the pairwise distance matrix of shape n_samples * n_exemplars, which is used for computing distance memberships, can also be very large. Again using a
batch_sizemight fix the issue.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
Thanks for implementing this feature!
Has anyone had issues using
all_points_membership_vectorswith a large dataset? (for me, original embeddings are 235002 x 384). It causes my Python kernel to fail. Apologies if this isn't enough detail.
Originally posted by @hatemr in https://github.com/rapidsai/cuml/issues/4724#issuecomment-1370284884
@hatemr can you provide the error you are getting when the Python kernel fails and describe your environment (type of GPU, etc...)?
Originally posted by @cjnolet in https://github.com/rapidsai/cuml/issues/4724#issuecomment-1370288634
g4dn.8xlarge
![]()
![]()
Originally posted by @hatemr in https://github.com/rapidsai/cuml/issues/4724#issuecomment-1370295491
We've observed that there can be challenges with memory pressure at scale, usually if there are a large number of clusters. @hatemr , could you also share the HDBSCAN configuration you're using?
If you're able to use larger values for min_samples, the workload will require less memory.
HDBSCAN(min_cluster_size=20,
metric='euclidean',
cluster_selection_method='eom',
prediction_data=True)
Thanks, I'll try adjusting min_samples
HDBSCAN(min_cluster_size=20, metric='euclidean', cluster_selection_method='eom', prediction_data=True)Thanks, I'll try adjusting
min_samples
Increasing min_samples worked! I had to increase min_samples itself - increasing min_cluster_size while keeping min_samples=None didn't work.
In summary, it seems that as you increase your training set size, to avoid crashes you should increase min_samples (and maybe min_cluster_size) too.
Another user ran into this during topic modeling (cross-posting for convenience in addition to the cross-link).
I see you're using
calculate_probabilities=True. The difference is likely that in BERTopic v0.13 the HDBSCAN soft probabilities are being calculated with cuML's HDBSCAN, where is in BERTopic v0.12 they were being calculated using the CPU HDBSCAN (or being skipped if you passed cuML's HDBSCAN). This is a good thing in general, as calculating the soft probabilities on 2 million documents with CPU HDBSCAN would likely be very slow (see this blog for some loose benchmarks in which it took 15+ hours on a CPU to do 400K documents vs. 1-2 seconds on a GPU, though results will vary across CPU and GPU types).
Computing the soft probabilities with cuML's HDBSCAN can have significant memory requirements if the number of clusters is very large. We have an open issue to address this, and there are some recommendations for how to work around the memory issues in the meantime by changing the HDBSCAN configuration in the issue. If you prefer to use a very small minimum cluster size, you may need to use the CPU version for now (but you can still use cuML's UMAP).
I'm going to cross-link https://github.com/MaartenGr/BERTopic/issues/922 and recommend we take further discussion to that issue. Since this issue is now a duplicate of #4879 from a cuML perspective, I'm going to close it. But please feel free to continue the discussion in the BERTopic issue or in #4879
Originally posted by @beckernick in https://github.com/rapidsai/cuml/issues/5127#issuecomment-1380440159
I've switched to 23.04 and tried using HDBSCAN with batch_size set to various values, but I still quickly run out of memory. Before I investigate this further, I was wondering whether you have any benchmarks on the memory savings achieved through batching and maybe a general (very coarse) rule of thumb for estimating how much memory might be needed given inputs size and parameters?
Could you share the code you're using (and the number of unique clusters HDBSCAN finds in your data)?
The batching is done for computing the pairwise distance matrix between the data and the examplar points for each cluster. This is the most memory-intensive step, but other steps may still peak at higher memory than available on a given GPU depending on the data. In sample tests, we've seen fairly significant memory reductions, though ultimately it will depend on your data size and the number of clusters.
@tarang-jain , perhaps we could add some documentation about memory best practices?
# %load_ext gpu_memory_profiler
import cuml
import numpy as np
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=200000,
centers=500,
n_features=5,
random_state=12
)
X = X.astype("float32")
clusterer = cuml.cluster.HDBSCAN(prediction_data=True, min_samples=5).fit(X)
print(len(np.unique(clusterer.labels_)))
%gpu_memit cuml.cluster.hdbscan.all_points_membership_vectors(clusterer, batch_size=len(X)) # simulating v23.02 behavior
%gpu_memit cuml.cluster.hdbscan.all_points_membership_vectors(clusterer, batch_size=4096) # new default
339
Peak GPU memory: 8101.00 MiB
Peak GPU memory: 2169.00 MiB
Regardless of the batch_size, the GPU memory should be large enough to store the membership vector (n_rows * n_clusters * sizeof(float)), in addition to storing some more cached data that we use to compute the soft cluster memberships. We do not have a rule of thumb as of now, but from our test runs, a batch_size of 4096 gave a reduction in memory usage while maintaining a high volatile GPU Utilization. I would suggest starting with a small batch size of 32-64 and gradually increasing it to get the best performance.
Hi, I'm looking into this now, but I can't seem to find that gpu_memory_profiler extension you're using for measuring gpu memory usage - can someone point me to it? @beckernick
Hi @kuchenrolle , would you be able to share some details about your use case and data? We've improved the memory quite a bit using the default batch size of 4096. Do you run into issues at every batch size?
The extension isn't available as a package at the moment, but we can evaluate the feasibility of releasing it. It's essentially just measuring the output of nvidia-smi, so you should be able to visually evaluate the peak memory during your workflow with something like: watch -n 0.25 "nvidia-smi -i 0" (for GPU 0).
I was getting some weird results when using the python nvml bindings and when I was running things in a separate process (to avoid the kernel crashing when I run out of memory), but I've got around that now. The problem is that my dataset is a good bit larger (a bit over 2 million datapoints with 25-100 features), so that even with fairly small batches I run out of memory quickly. This is the code I'm using:
import cuml
import multiprocessing as mp
import os
import subprocess
import threading
import time
import torch
from sklearn.datasets import make_blobs
class GPUMemoryProfiler:
def __init__(self, gpu=0):
self.gpu = gpu
self.running = False
def monitor_memory(self):
self.peak_memory_usage = gpu_memory_used(self.gpu)
while self.running:
current_memory = gpu_memory_used(self.gpu)
self.peak_memory_usage = max(self.peak_memory_usage, current_memory)
time.sleep(0.2)
def __enter__(self):
self.running = True
self.initial_memory_usage = gpu_memory_used(self.gpu)
self.memory_thread = threading.Thread(target=self.monitor_memory)
self.memory_thread.start()
return self
def __exit__(self, exc_type, exc_value, traceback):
self.running = False
self.memory_thread.join()
def gpu_memory_used(gpu: int = 0):
command = ['nvidia-smi', f'--id={gpu}', '--query-gpu=memory.used', '--format=csv']
memory_used_info = subprocess.check_output(command).decode('ascii').split('\n')[-2]
return int(memory_used_info.split()[0])
def hdbscan_run(n_samples=1_000_000, centers=500, batch_size=4_096):
X, y = make_blobs(
n_samples=n_samples,
centers=centers,
n_features=50,
random_state=2311
)
X = X.astype("float32")
clusterer = cuml.cluster.HDBSCAN(prediction_data=True, min_samples=5).fit(X)
with GPUMemoryProfiler() as batching:
cuml.cluster.hdbscan.all_points_membership_vectors(clusterer, batch_size=batch_size)
print(f"Peak memory for {n_samples} samples from {centers} centers with batch size {batch_size} was: {batching.peak_memory_usage}")
def hdbscan_profile():
mp.set_start_method("spawn")
for n_samples in (200_000, 500_000, 1_000_000, 2_000_000):
for centers in (50, 100, 500, 1_000, 2_000, 5_000):
for batch_size in (10, 4_096, n_samples):
p = run_in_subprocess_on_gpu(hdbscan_run,
gpu=0,
n_samples=n_samples,
centers=centers,
batch_size=batch_size)
p.join()
torch.cuda.empty_cache()
def run_in_subprocess_on_gpu(fun, gpu, *args, **kwargs):
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu)
process = mp.Process(target=fun, args=args, kwargs=kwargs)
process.start()
return process
And this is the results I get:
| n_samples | centers | batch_size | memory (MB) |
|---|---|---|---|
| 200000 | 50 | 10 | 1104 |
| 200000 | 50 | 4096 | 910 |
| 200000 | 50 | 200000 | 910 |
| 200000 | 100 | 10 | 1066 |
| 200000 | 100 | 4096 | 1144 |
| 200000 | 100 | 200000 | 988 |
| 200000 | 500 | 10 | 1674 |
| 200000 | 500 | 4096 | 1722 |
| 200000 | 500 | 200000 | 2056 |
| 200000 | 1000 | 10 | 2438 |
| 200000 | 1000 | 4096 | 2438 |
| 200000 | 1000 | 200000 | 3316 |
| 200000 | 2000 | 10 | 3966 |
| 200000 | 2000 | 4096 | 4250 |
| 200000 | 2000 | 200000 | 12196 |
| 200000 | 5000 | 10 | 8548 |
| 200000 | 5000 | 4096 | 8952 |
| 200000 | 5000 | 200000 | 28214 |
| 500000 | 50 | 10 | 1172 |
| 500000 | 50 | 4096 | 1184 |
| 500000 | 50 | 500000 | 1188 |
| 500000 | 100 | 10 | 1364 |
| 500000 | 100 | 4096 | 1384 |
| 500000 | 100 | 500000 | 1170 |
| 500000 | 500 | 10 | 2888 |
| 500000 | 500 | 4096 | 3062 |
| 500000 | 500 | 500000 | 10056 |
| 500000 | 1000 | 10 | 4796 |
| 500000 | 1000 | 4096 | 4896 |
| 500000 | 1000 | 500000 | 17068 |
| 500000 | 2000 | 10 | 8616 |
| 500000 | 2000 | 4096 | 8796 |
| 500000 | 2000 | 500000 | 30630 |
| 500000 | 5000 | 10 | nan |
| 500000 | 5000 | 4096 | nan |
| 500000 | 5000 | 500000 | nan |
| 1000000 | 50 | 10 | 1490 |
| 1000000 | 50 | 4096 | 1490 |
| 1000000 | 50 | 1000000 | 1608 |
| 1000000 | 100 | 10 | 1870 |
| 1000000 | 100 | 4096 | 1898 |
| 1000000 | 100 | 1000000 | 8322 |
| 1000000 | 500 | 10 | 5036 |
| 1000000 | 500 | 4096 | 5000 |
| 1000000 | 500 | 1000000 | 21960 |
| 1000000 | 1000 | 10 | 8852 |
| 1000000 | 1000 | 4096 | 8854 |
| 1000000 | 1000 | 1000000 | nan |
| 1000000 | 2000 | 10 | 16370 |
| 1000000 | 2000 | 4096 | 16572 |
| 1000000 | 2000 | 1000000 | nan |
| 1000000 | 5000 | 10 | nan |
| 1000000 | 5000 | 4096 | nan |
| 1000000 | 5000 | 1000000 | nan |
| 2000000 | 50 | 10 | 2120 |
| 2000000 | 50 | 4096 | 2144 |
| 2000000 | 50 | 2000000 | 13528 |
| 2000000 | 100 | 10 | 2884 |
| 2000000 | 100 | 4096 | 2920 |
| 2000000 | 100 | 2000000 | 19662 |
| 2000000 | 500 | 10 | 9102 |
| 2000000 | 500 | 4096 | 9080 |
| 2000000 | 500 | 2000000 | nan |
| 2000000 | 1000 | 10 | 16732 |
| 2000000 | 1000 | 4096 | 16756 |
| 2000000 | 1000 | 2000000 | nan |
| 2000000 | 2000 | 10 | nan |
| 2000000 | 2000 | 4096 | nan |
| 2000000 | 2000 | 2000000 | nan |
| 2000000 | 5000 | 10 | nan |
| 2000000 | 5000 | 4096 | nan |
| 2000000 | 5000 | 2000000 | nan |
nan is where cuda ran out of memory on the 40gb h100. It seems going lower than the default batch size doesn't make that much of a difference at a certain point, even at a batch size of 2 I run out of space with my actual data. Does the full n_samples * num_clusters * sizeof(float) need to be on the GPU at all times for the computations? At 4 bytes per float and over 2 million samples, 5000 clusters quickly exhaust the 40gb. Couldn't it stay in RAM and only the current batch would live on the GPU? It's so fast, I'd happily trade some speed for being able to run these larger models.
Any ideas? (:
