KeyphraseVectorizers icon indicating copy to clipboard operation
KeyphraseVectorizers copied to clipboard

Memory Issues

Open amoschoomy opened this issue 2 years ago • 2 comments

First up, thank you for your work and the results from BERTopic topic modelling works as I expected with this vectorizer. However I am running into out of memory issues in both Google Colab and Kaggle on my custom dataset, about 7500 documents. I do not have any paid subscription to Google Cloud Platform or Colab Pro so it is not an option to run my codes using those. Is there any tricks or tips to optimise this vectorizer on large datasets? Thank you

amoschoomy avatar Mar 27 '22 11:03 amoschoomy

I encountered the same issue with BERTopic using a large dataset. The way BERTopic uses the vectorizer somehow results in huge memory consumption. I suspect the reason for this is that BERTopic passes the concatenated documents of a topic as a huge string to the vectorizer and spaCy can't handle such a large string. Unfortunately, I do not have a satisfactory solution for this problem currently. As soon as I have some time to spare, I will take a closer look at the problem.

TimSchopf avatar Apr 09 '22 11:04 TimSchopf

I encountered the same issue with BERTopic using a large dataset. The way BERTopic uses the vectorizer somehow results in huge memory consumption. I suspect the reason for this is that BERTopic passes the concatenated documents of a topic as a huge string to the vectorizer and spaCy can't handle such a large string. Unfortunately, I do not have a satisfactory solution for this problem currently. As soon as I have some time to spare, I will take a closer look at the problem.

I am also facing the same memory problem with large dataset. Secondly, it becomes quite slow on large dataset. I have combined keyphrasevectorizer with keybert. Any help is very much appreciated!

rubypnchl avatar Mar 09 '23 03:03 rubypnchl

Solved with the v0.0.12 release.

TimSchopf avatar Apr 29 '24 13:04 TimSchopf