cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[FEA] Reduce memory pressure in HDBSCAN all_points_membership_vectors (or provide/link to best practices)

Open beckernick opened this issue 3 years ago • 2 comments

HDBSCAN now supports soft clustering with all_points_membership_vectors. This functionality has such high performance it's possible to use this on large datasets. However, memory pressure can quickly become an issue with some combinations of HDBSCAN parameters (including the default) for large datasets for which HDBSCAN wants to find identify many clusters.

To the extent that we can reduce pressure, we should explore it. Performance is so fast compared to CPUs that it may be worth exploring tradeoffs for the user experience. We could also look into documenting this behavior and best practices, or link to existing guides if they exist.

To illustrate with a real-world example, I took the Million News Headlines dataset from Kaggle and encoded each document into an embedding space using the all-MiniLM-L6-v2 transformer model. I then selected the first 200,000 documents and used cuML's UMAP to reduce the dimensionality to 5. This derived dataset is available here (NVIDIA).

In this example, memory spikes to about 40 GB on my 48GB RTX 8000 GPU. Scaling to larger datasets becomes infeasible.

import cuml
import numpy as np
import pandas as pd

X = np.load("million_news_articles_embeddings_reduced_200000_5.npy")

clusterer = cuml.cluster.hdbscan.HDBSCAN(
    prediction_data=True
)
clusterer.fit(X)
soft_cluster = cuml.cluster.hdbscan.all_points_membership_vectors(clusterer)
pd.Series(clusterer.labels_).nunique()
5095

I can reduce the pressure to 5 GB by changing the min_samples parameter to 20 (from the default of 5):

import cuml
import numpy as np
import pandas as pd

X = np.load("million_news_articles_embeddings_reduced_200000_5.npy")

clusterer = cuml.cluster.hdbscan.HDBSCAN(
    min_samples=20,
    prediction_data=True,
)
clusterer.fit(X)
soft_cluster = cuml.cluster.hdbscan.all_points_membership_vectors(clusterer)
pd.Series(clusterer.labels_).nunique()
1548

I observe this behavior in the current 22.10 nightly as well as when I build from source and include https://github.com/rapidsai/cuml/pull/4872

Reproducing this behavior with synthetic data may require intentionally creating many low standard deviation clusters. With the default configuration, I observe memory spikes of up to about 20GB.

from sklearn.datasets import make_blobs
import cuml
import numpy as np
import pandas as pd

X, y = make_blobs(
    n_samples=200000,
    centers=2000,
    n_features=5,
    cluster_std=0.1,
    random_state=0
)

clusterer = cuml.cluster.hdbscan.HDBSCAN(prediction_data=True)
clusterer.fit(X)
soft_cluster = cuml.cluster.hdbscan.all_points_membership_vectors(clusterer)
pd.Series(clusterer.labels_).nunique()
1999

beckernick avatar Aug 31 '22 13:08 beckernick

As a note, the memory pressure is not observed during fit, regardless of whether the PredictionData is being generated.

import cuml
import numpy as np
import pandas as pd

X = np.load("million_news_articles_embeddings_reduced_200000_5.npy")

clusterer = cuml.cluster.hdbscan.HDBSCAN(
    prediction_data=True
)
clusterer.fit(X)

beckernick avatar Aug 31 '22 14:08 beckernick

I think there might be two culprits here:

  1. Allocating the membership vectors itself: for topic modeling tasks, we have observed that there are a large number of clusters and therefore allocating n_samples * n_clusters on the device can be memory intensive. One possible direction for the future to fix this is to add a hyperparameter called batch_size to all_points_membership_vectors and do computation of membership vectors in batches.
  2. A lot of clusters also means a lot of exemplars and the pairwise distance matrix of shape n_samples * n_exemplars, which is used for computing distance memberships, can also be very large. Again using a batch_size might fix the issue.

tarang-jain avatar Aug 31 '22 14:08 tarang-jain

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Oct 01 '22 17:10 github-actions[bot]

Thanks for implementing this feature!

Has anyone had issues using all_points_membership_vectors with a large dataset? (for me, original embeddings are 235002 x 384). It causes my Python kernel to fail. Apologies if this isn't enough detail.

Originally posted by @hatemr in https://github.com/rapidsai/cuml/issues/4724#issuecomment-1370284884

@hatemr can you provide the error you are getting when the Python kernel fails and describe your environment (type of GPU, etc...)?

Originally posted by @cjnolet in https://github.com/rapidsai/cuml/issues/4724#issuecomment-1370288634

g4dn.8xlarge image image image

Originally posted by @hatemr in https://github.com/rapidsai/cuml/issues/4724#issuecomment-1370295491

We've observed that there can be challenges with memory pressure at scale, usually if there are a large number of clusters. @hatemr , could you also share the HDBSCAN configuration you're using?

If you're able to use larger values for min_samples, the workload will require less memory.

beckernick avatar Jan 03 '23 23:01 beckernick

HDBSCAN(min_cluster_size=20,
                  metric='euclidean',
                  cluster_selection_method='eom',
                  prediction_data=True)

Thanks, I'll try adjusting min_samples

hatemr avatar Jan 03 '23 23:01 hatemr

HDBSCAN(min_cluster_size=20,
                  metric='euclidean',
                  cluster_selection_method='eom',
                  prediction_data=True)

Thanks, I'll try adjusting min_samples

Increasing min_samples worked! I had to increase min_samples itself - increasing min_cluster_size while keeping min_samples=None didn't work.

In summary, it seems that as you increase your training set size, to avoid crashes you should increase min_samples (and maybe min_cluster_size) too.

hatemr avatar Jan 04 '23 15:01 hatemr

Another user ran into this during topic modeling (cross-posting for convenience in addition to the cross-link).

I see you're using calculate_probabilities=True. The difference is likely that in BERTopic v0.13 the HDBSCAN soft probabilities are being calculated with cuML's HDBSCAN, where is in BERTopic v0.12 they were being calculated using the CPU HDBSCAN (or being skipped if you passed cuML's HDBSCAN). This is a good thing in general, as calculating the soft probabilities on 2 million documents with CPU HDBSCAN would likely be very slow (see this blog for some loose benchmarks in which it took 15+ hours on a CPU to do 400K documents vs. 1-2 seconds on a GPU, though results will vary across CPU and GPU types).

Computing the soft probabilities with cuML's HDBSCAN can have significant memory requirements if the number of clusters is very large. We have an open issue to address this, and there are some recommendations for how to work around the memory issues in the meantime by changing the HDBSCAN configuration in the issue. If you prefer to use a very small minimum cluster size, you may need to use the CPU version for now (but you can still use cuML's UMAP).

I'm going to cross-link https://github.com/MaartenGr/BERTopic/issues/922 and recommend we take further discussion to that issue. Since this issue is now a duplicate of #4879 from a cuML perspective, I'm going to close it. But please feel free to continue the discussion in the BERTopic issue or in #4879

Originally posted by @beckernick in https://github.com/rapidsai/cuml/issues/5127#issuecomment-1380440159

beckernick avatar Jan 12 '23 14:01 beckernick

I've switched to 23.04 and tried using HDBSCAN with batch_size set to various values, but I still quickly run out of memory. Before I investigate this further, I was wondering whether you have any benchmarks on the memory savings achieved through batching and maybe a general (very coarse) rule of thumb for estimating how much memory might be needed given inputs size and parameters?

kuchenrolle avatar Apr 17 '23 13:04 kuchenrolle

Could you share the code you're using (and the number of unique clusters HDBSCAN finds in your data)?

The batching is done for computing the pairwise distance matrix between the data and the examplar points for each cluster. This is the most memory-intensive step, but other steps may still peak at higher memory than available on a given GPU depending on the data. In sample tests, we've seen fairly significant memory reductions, though ultimately it will depend on your data size and the number of clusters.

@tarang-jain , perhaps we could add some documentation about memory best practices?

# %load_ext gpu_memory_profiler 

import cuml
import numpy as np
from sklearn.datasets import make_blobs

X, y = make_blobs(
    n_samples=200000,
    centers=500,
    n_features=5,
    random_state=12
)
X = X.astype("float32")

clusterer = cuml.cluster.HDBSCAN(prediction_data=True, min_samples=5).fit(X)
print(len(np.unique(clusterer.labels_)))
%gpu_memit cuml.cluster.hdbscan.all_points_membership_vectors(clusterer, batch_size=len(X)) # simulating v23.02 behavior
%gpu_memit cuml.cluster.hdbscan.all_points_membership_vectors(clusterer, batch_size=4096) # new default
339
Peak GPU memory: 8101.00 MiB
Peak GPU memory: 2169.00 MiB

beckernick avatar Apr 17 '23 14:04 beckernick

Regardless of the batch_size, the GPU memory should be large enough to store the membership vector (n_rows * n_clusters * sizeof(float)), in addition to storing some more cached data that we use to compute the soft cluster memberships. We do not have a rule of thumb as of now, but from our test runs, a batch_size of 4096 gave a reduction in memory usage while maintaining a high volatile GPU Utilization. I would suggest starting with a small batch size of 32-64 and gradually increasing it to get the best performance.

tarang-jain avatar Apr 17 '23 14:04 tarang-jain

Hi, I'm looking into this now, but I can't seem to find that gpu_memory_profiler extension you're using for measuring gpu memory usage - can someone point me to it? @beckernick

kuchenrolle avatar Nov 06 '23 13:11 kuchenrolle

Hi @kuchenrolle , would you be able to share some details about your use case and data? We've improved the memory quite a bit using the default batch size of 4096. Do you run into issues at every batch size?

The extension isn't available as a package at the moment, but we can evaluate the feasibility of releasing it. It's essentially just measuring the output of nvidia-smi, so you should be able to visually evaluate the peak memory during your workflow with something like: watch -n 0.25 "nvidia-smi -i 0" (for GPU 0).

beckernick avatar Nov 06 '23 21:11 beckernick

I was getting some weird results when using the python nvml bindings and when I was running things in a separate process (to avoid the kernel crashing when I run out of memory), but I've got around that now. The problem is that my dataset is a good bit larger (a bit over 2 million datapoints with 25-100 features), so that even with fairly small batches I run out of memory quickly. This is the code I'm using:

import cuml
import multiprocessing as mp
import os
import subprocess
import threading
import time
import torch

from sklearn.datasets import make_blobs


class GPUMemoryProfiler:
    def __init__(self, gpu=0):
        self.gpu = gpu
        self.running = False

    def monitor_memory(self):
        self.peak_memory_usage = gpu_memory_used(self.gpu)
        while self.running:
            current_memory = gpu_memory_used(self.gpu)
            self.peak_memory_usage = max(self.peak_memory_usage, current_memory)
            time.sleep(0.2)

    def __enter__(self):
        self.running = True
        self.initial_memory_usage = gpu_memory_used(self.gpu)
        self.memory_thread = threading.Thread(target=self.monitor_memory)
        self.memory_thread.start()
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.running = False
        self.memory_thread.join()


def gpu_memory_used(gpu: int = 0):
    command = ['nvidia-smi', f'--id={gpu}', '--query-gpu=memory.used', '--format=csv']
    memory_used_info = subprocess.check_output(command).decode('ascii').split('\n')[-2]
    return int(memory_used_info.split()[0])


def hdbscan_run(n_samples=1_000_000, centers=500, batch_size=4_096):
    X, y = make_blobs(
        n_samples=n_samples,
        centers=centers,
        n_features=50,
        random_state=2311
    )
    X = X.astype("float32")

    clusterer = cuml.cluster.HDBSCAN(prediction_data=True, min_samples=5).fit(X)

    with GPUMemoryProfiler() as batching:
        cuml.cluster.hdbscan.all_points_membership_vectors(clusterer, batch_size=batch_size)

    print(f"Peak memory for {n_samples} samples from {centers} centers with batch size {batch_size} was: {batching.peak_memory_usage}")

def hdbscan_profile():
    mp.set_start_method("spawn")

    for n_samples in (200_000, 500_000, 1_000_000, 2_000_000):
        for centers in (50, 100, 500, 1_000, 2_000, 5_000):
            for batch_size in (10, 4_096, n_samples):
                p = run_in_subprocess_on_gpu(hdbscan_run,
                                             gpu=0,
                                             n_samples=n_samples,
                                             centers=centers,
                                             batch_size=batch_size)
                p.join()
                torch.cuda.empty_cache()

def run_in_subprocess_on_gpu(fun, gpu, *args, **kwargs):
    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu)

    process = mp.Process(target=fun, args=args, kwargs=kwargs)
    process.start()

    return process

And this is the results I get:

n_samples centers batch_size memory (MB)
200000 50 10 1104
200000 50 4096 910
200000 50 200000 910
200000 100 10 1066
200000 100 4096 1144
200000 100 200000 988
200000 500 10 1674
200000 500 4096 1722
200000 500 200000 2056
200000 1000 10 2438
200000 1000 4096 2438
200000 1000 200000 3316
200000 2000 10 3966
200000 2000 4096 4250
200000 2000 200000 12196
200000 5000 10 8548
200000 5000 4096 8952
200000 5000 200000 28214
500000 50 10 1172
500000 50 4096 1184
500000 50 500000 1188
500000 100 10 1364
500000 100 4096 1384
500000 100 500000 1170
500000 500 10 2888
500000 500 4096 3062
500000 500 500000 10056
500000 1000 10 4796
500000 1000 4096 4896
500000 1000 500000 17068
500000 2000 10 8616
500000 2000 4096 8796
500000 2000 500000 30630
500000 5000 10 nan
500000 5000 4096 nan
500000 5000 500000 nan
1000000 50 10 1490
1000000 50 4096 1490
1000000 50 1000000 1608
1000000 100 10 1870
1000000 100 4096 1898
1000000 100 1000000 8322
1000000 500 10 5036
1000000 500 4096 5000
1000000 500 1000000 21960
1000000 1000 10 8852
1000000 1000 4096 8854
1000000 1000 1000000 nan
1000000 2000 10 16370
1000000 2000 4096 16572
1000000 2000 1000000 nan
1000000 5000 10 nan
1000000 5000 4096 nan
1000000 5000 1000000 nan
2000000 50 10 2120
2000000 50 4096 2144
2000000 50 2000000 13528
2000000 100 10 2884
2000000 100 4096 2920
2000000 100 2000000 19662
2000000 500 10 9102
2000000 500 4096 9080
2000000 500 2000000 nan
2000000 1000 10 16732
2000000 1000 4096 16756
2000000 1000 2000000 nan
2000000 2000 10 nan
2000000 2000 4096 nan
2000000 2000 2000000 nan
2000000 5000 10 nan
2000000 5000 4096 nan
2000000 5000 2000000 nan

nan is where cuda ran out of memory on the 40gb h100. It seems going lower than the default batch size doesn't make that much of a difference at a certain point, even at a batch size of 2 I run out of space with my actual data. Does the full n_samples * num_clusters * sizeof(float) need to be on the GPU at all times for the computations? At 4 bytes per float and over 2 million samples, 5000 clusters quickly exhaust the 40gb. Couldn't it stay in RAM and only the current batch would live on the GPU? It's so fast, I'd happily trade some speed for being able to run these larger models.

kuchenrolle avatar Dec 01 '23 00:12 kuchenrolle

Any ideas? (:

kuchenrolle avatar Dec 19 '23 21:12 kuchenrolle