KeyBERT icon indicating copy to clipboard operation
KeyBERT copied to clipboard

Combination of KeyBERT + BERTopic returns an error

Open mpoiaganova opened this issue 1 year ago • 7 comments

Hello,

Not sure if that's an issue of KeyBERT or more of a BERTopic, but I am trying to run KeyBERT + BERTopic as explained in the documentation, and getting a ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I am running the exact same two cells as in the documentation, so the problem does not come from my custom data/input. Attaching the error log screenshots. Thanks in advance!

sh1 sh2

mpoiaganova avatar Mar 08 '23 10:03 mpoiaganova

With respect to your code, it is difficult to say without seeing the full picture. What is in vocabulary? How many words are in there? Also, how many documents are you passing to BERTopic? More specifically, it might be that you do not have enough words in the vocabulary for each cluster to actually contain at least one word.

MaartenGr avatar Mar 09 '23 12:03 MaartenGr

Also, if you are interested in using a KeyBERT-like algorithm in BERTopic, I would advise applying BERTopic's KeyBERTInspired representation model.

MaartenGr avatar Mar 09 '23 12:03 MaartenGr

Also, if you are interested in using a KeyBERT-like algorithm in BERTopic, I would advise applying BERTopic's KeyBERTInspired representation model.

Hi Maarten, I am also facing not exactly but something related problem, when I am using keybert+keyphrasevectorizer to generate vocabulary for bertopic. It is giving two kinds of issues: 1) memory issues 2) kernel crashed even for 20k abstracts (while using on WSL). It works on windows for upto 100k abstracts. I want to know: First, Can we speed up the combination of keybert+keyphrasevectorizer( for 100k abstracts it took 13 hours for vocabulary generation). Second, how to resolve this repetitive kernel dying problem. below is the code I am using

from keybert import KeyBERT from sentence_transformers import SentenceTransformer import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(device) logging.info("Starting KeyBERT...") sentence_model = SentenceTransformer("paraphrase-MiniLM-L12-v2", device=device) sentence_model = sentence_model.to(device) logging.info(f"Using device: {sentence_model.device}") kw_model = KeyBERT(sentence_model)

import pickle import os from keyphrase_vectorizers import KeyphraseCountVectorizer

vectorizer_model=KeyphraseCountVectorizer()

check if file exists

keyword_file = f"{YEAR_MONTH}/keywords.dump" if os.path.exists(keyword_file): with open(keyword_file, "rb") as fp: keywords = pickle.load(fp) else: keywords = kw_model.extract_keywords(abstracts,vectorizer=vectorizer_model,use_mmr=True, keyphrase_ngram_range=(1,5)) with open(keyword_file, "wb") as fp: pickle.dump(keywords, fp) logging.info(f"Extracted {len(keywords)} keywords.")

for the above code kernel crashes even for 20k abstracts on wsl.

Thanks in advance!

rubypnchl avatar Mar 09 '23 13:03 rubypnchl

I believe this has to do with how the KeyphraseCountVectorizer creates the candidate keywords to be checked which can be computationally quite expensive. Perhaps looking at the KeyphraseCountVectorizer hyperparameters might help but I am not quite sure. I would advise sharing your use case at the repo there.

MaartenGr avatar Mar 10 '23 05:03 MaartenGr

With respect to your code, it is difficult to say without seeing the full picture. What is in vocabulary? How many words are in there? Also, how many documents are you passing to BERTopic? More specifically, it might be that you do not have enough words in the vocabulary for each cluster to actually contain at least one word.

Thanks for the answer, and sorry I was not enough clear. I was running the exactly same code as from the documentation, image attached, so vocabulary was initialized as per that example. I thought that reproducing that example should not result in such an error, or could it?

I also tried to test with my own documents and vocabulary and made sure that the vocabulary contained enough words to cluster the documents, but it didn't work out either with the same error.

Screenshot 2023-03-10 at 20 29 38

mpoiaganova avatar Mar 10 '23 19:03 mpoiaganova

Hmmm, I am not entirely sure what is happening. I'll have to take a look. Either way, I would advise using KeyBERTInspired instead as it is much more optimized for this task and has similar performance. Moreover, I might just remove that piece of the documentation here as KeyBERTInspired was created for just this.

MaartenGr avatar Mar 12 '23 05:03 MaartenGr

Ok, I'll use KeyBERTInspired then.

Thanks for advice and for the effort in creating KeyBERT and BERTopic: great helpful tools!

mpoiaganova avatar Mar 12 '23 15:03 mpoiaganova