BERTopic GPU accelerated UMAP and HDBSCAN issues: memory and predict

Hello everyone,

First issue: memory

cuml.manifold.Umap is crashing with the following error every time whenever I fit_transform on more than 1500000 documents.

2022-07-27 15:40:09,788 - BERTopic - Transformed documents to Embeddings Traceback (most recent call last): File "/home/natethegreat/bertopic/bertopic_model_cuml.py", line 17, in topic_model.fit(docs) File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 237, in fit self.fit_transform(documents, embeddings, y) File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 313, in fit_transform umap_embeddings = self._reduce_dimensionality(embeddings, y) File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 2070, in _reduce_dimensionality umap_embeddings = self.umap_model.transform(embeddings) File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get ret_val = func(*args, **kwargs) File "cuml/manifold/umap.pyx", line 730, in cuml.manifold.umap.UMAP.transform MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /home/natethegreat/miniconda3/envs/torchrapids/include/rmm/mr/device/cuda_memory_resource.hpp Segmentation fault

The same error does not occur if I use the cpu versions of umap and hdbscan. My understanding is that it happens because (obviously) the amount of dedicated gpu memory is much smaller than regular ram (8gb vs 128gb in my case).

Are there any options to circumvent this issue? Like maybe we could split the process into smaller batches or use shared gpu memory?

Also, a model trained with less than 1500000 documents has two weird things to it:

The resulting topics often have duplicated words inside e.g. "14_gold_gold money_silver_money gold"
Regardless of whether or not training is successful the message "Segmentation fault" is printed at the end

Second issue: model.predict With a model trained with GPU accelerated versions of umap and hdbscan running model.transform([sentence]) causes the following error:

Traceback (most recent call last): File "", line 1, in File "/home/natethegreat/miniconda3/envs/torchrapids/lib/python3.9/site-packages/bertopic/_bertopic.py", line 404, in transform predictions = self.hdbscan_model.predict(umap_embeddings) File "cuml/common/base.pyx", line 269, in cuml.common.base.Base.getattr AttributeError: predict

Thank you regardless of whether something comes out of this.

code:

from bertopic import BERTopic from cuml.cluster import HDBSCAN from cuml.manifold import UMAP from sklearn.feature_extraction.text import CountVectorizer import pickle

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0) hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=100)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model, verbose=True, calculate_probabilities=False, low_memory=True)

docs = pickle.load(open("docs_0.35.pkl", "rb")) # <--- 220 mb. approx. 1.6m samples

topic_model.fit(docs)

pickle.dump(topic_model, open("bert_model_2.0.pkl", "wb"))

nvidia-smi:

Jul 27 '22 14:07 NateTheGreat001

Hello, I get same problem as Second issue: model.predict

Did you manage to solve it?

Jul 29 '22 08:07 pepi99

Sadly not. Still waiting for someone to bail us out

Jul 29 '22 14:07 ghost

The first issue is likely an out-of-memory. Though UMAP should not spike memory too significantly, it's possible that the requirements plus other data on your GPU may be causing an OOM. UMAP creates a sparse KNN graph on the order of n_samples * n_neighbors, so reducing the number of neighbors will reduce memory needs for that data structure.

The predict error is due to cuML's HDBSCAN not yet supporting approximate_predict. Please see https://github.com/MaartenGr/BERTopic/issues/647#issuecomment-1199863402 for more context.

Jul 29 '22 20:07 beckernick

Yes, as beckernick said, just remove the hdbscan=... in your bertopic model when you initialized, no problems after that! :)

Jul 31 '22 21:07 pepi99

Thank you guys!

Aug 01 '22 13:08 ghost

approximate_predict function for HDBSCAN: https://github.com/rapidsai/cuml/pull/4872 Nightly Release: https://github.com/rapidsai/cuml/releases/tag/v22.10.00a

Installed the nightly release but I'm still getting the error mentioned above... Any ideas?

Sep 26 '22 02:09 sebastien-mcrae

@sebastien-mcrae The support for for cuML's HDBSCAN as a 1-on-1 replacement for the CPU HDBSCAN is not yet implemented in BERTopic and will take some time before it is fully implemented.

In your case, when you use cuML's HDBSCAN it recognizes it as a cluster model, not necessarily an HDBSCAN-like model. As such, it defaults back to what is expected from cluster models in BERTopic, namely that it needs .fit and .predict functions for it to work.

Sep 26 '22 08:09 MaartenGr

Due to inactivity and with the v0.13 release of BERTopic supporting more native functions of cuML, I'll be closing this for now. If you have any questions or want to continue the discussion, I'll make sure to re-open the issue!

Jan 09 '23 12:01 MaartenGr