BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Memory error with ~1m documents (no GPU available, low_memory=True)

Open Rorickt opened this issue 2 years ago • 3 comments

Hi!

First, Thank you for the library, I'm really enjoying working with it!

I am working with documents that are multiple sentences. I split them up and work with each sentence. Afterward I (plan to) merge them back to end up with multiple topics per document.

However, after I split my data I have around a million sentences and this seems to crash the kernel when using fit_transform(). I get the error of not being able to allocate enough memory when doing topic reduction through umap. When I set that to low_memory = True, I get the same error. It uses up all 32gb of ram that I have available. I have calculate_probabilities = False

Should I just accept the limitations of my system and work with a smaller (randomized) subset of my full data to reduce the load? Or are them some tricks I can still apply?

Rorickt avatar Jul 20 '22 11:07 Rorickt

Thank you for your kind words!

Scalability can definitely be an issue when handling a million documents. Specifically for that reason, I created an FAQ page that has a bunch of tricks that can help you out with that! Hopefully, these should suffice in making it possible to train your model.

There are a few other tricks that you can do that might be a bit more advanced:

  • fit on a smaller portion of the data and transform on the rest
  • Use another dimensionality reduction algorithm like PCA or another clustering algorithm like k-Means
  • Use GPU-accelerated UMAP and HDBSCAN (see this page)
  • Speed up UMAP with PCA-initialization (see this page)

MaartenGr avatar Jul 20 '22 12:07 MaartenGr

Thank you for your quick response!

I went through that page and it was indeed helpful! I have adjusted my parameters to follow those tips but to no avail. I do not have access to a GPU so unfortunately that is out. I was hoping to not have to fit on just a part of the data and transform on the rest as it would a bit of a shame ;)

I missed the tip of using PCA-acceleration so I'll try that too!

Also I just now saw there is another comment poster under issues dealing with this exactly! I'm sorry for repeating the question! There are good discussion in these sections and I should learn from there too!

Thanks again!

Rorickt avatar Jul 20 '22 12:07 Rorickt

No problem! Please feel free to post any questions or concerns you have even if they might already be mentioned somewhere else. It might happen that your use case is different and it would be a shame that a simple fix would be overlooked because of that 😄

MaartenGr avatar Jul 22 '22 15:07 MaartenGr