BERTopic
BERTopic copied to clipboard
ctfidf breaks down when specifying a vocabulary in CountVectorizer
In some cases, the stop_words parameter of the CountVectorizer is not enough to prevent certain non-desired words from coming through. For example, one may have the desire to filter out non-verbs like abbreviations before coming up with topic representations.
This can be done by specifying a vocubulary in the CountVectorizer object sklearn docs
However, a problem that occurs then is that ctfidf breaks down due to division by zero in line 82 of _ctfidf.py:
idf = np.log((avg_nr_samples / df)+1)
because it could be that some words in the vocabulary actually never occur.
I would therefore propose to change the line above to
idf = np.log((avg_nr_samples / np.maximum(df, 1))+1)
This solution does not change behaviour in normal cases and gives the optionality to specify a vocabulary when creating topic representations
Thanks for the issue and PR. Before I check it out, do you perhaps have a reproducible example? That way, I can verify the issue. Also, what would be the impact of your change on the wall time and output? Does your change influence a regular run?
Hi @MaartenGr, I've been seeing this warning a lot too. I think it's relevant to the way I ended up working after the discussion in #1665 so this example should be relevant.
from sklearn.datasets import fetch_20newsgroups
from keybert import KeyBERT
import numpy as np
import re
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
# Prepare documents
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
def preprocess_text(documents: np.ndarray):
""" Basic preprocessing of text
Steps:
* Replace \n and \t with whitespace
* Only keep alpha-numerical characters
"""
cleaned_documents = [doc.replace("\n", " ") for doc in documents]
cleaned_documents = [doc.replace("\t", " ") for doc in cleaned_documents]
cleaned_documents = [re.sub(r'[^A-Za-z0-9 ]+', '', doc) for doc in cleaned_documents]
cleaned_documents = [doc if doc != "" else "emptydoc" for doc in cleaned_documents]
return cleaned_documents
docs = preprocess_text(docs)
pre_vectorizer_model = CountVectorizer(min_df=10, ngram_range=(1,3), stop_words="english")
pre_vectorizer_model.fit(docs)
vocabulary = list(set(pre_vectorizer_model.vocabulary_.keys()))
vectorizer_model= CountVectorizer(vocabulary=vocabulary)
topic_model = BERTopic(vectorizer_model=vectorizer_model, verbose=True)
topics, probs = topic_model.fit_transform(docs)
2024-01-23 11:11:37,358 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%
589/589 [00:17<00:00, 102.73it/s]
2024-01-23 11:11:55,788 - BERTopic - Embedding - Completed ✓
2024-01-23 11:11:55,789 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-01-23 11:12:06,768 - BERTopic - Dimensionality - Completed ✓
2024-01-23 11:12:06,770 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-01-23 11:12:10,754 - BERTopic - Cluster - Completed ✓
2024-01-23 11:12:10,760 - BERTopic - Representation - Extracting topics from clusters using representation models.
[c:\path\lib\site-packages\bertopic\vectorizers\_ctfidf.py:82](file:///C:/path/envs/myenv/lib/site-packages/bertopic/vectorizers/_ctfidf.py:82): RuntimeWarning: divide by zero encountered in divide
idf = np.log((avg_nr_samples / df)+1)
2024-01-23 11:12:14,768 - BERTopic - Representation - Completed ✓
I haven't run it through the PR yet though.
Note that setting the ngram_range
in the pre_vectorizer_model
seems to be required to produce the warning.