BERTopic
BERTopic copied to clipboard
empty topics
Hello,
I'm getting empty topics such as ('', 1e-05); is it due to a preprocessing step, do you have any idea what could be causing that? Thank you!
It might be that the documents in that specific topic are (near-)empty which would also result in an empty topic. I would suggest exploring the documents on that topic to see if anything stands up. For that, you could use .get_representative_docs()
or simply check all the documents belonging to that topic if there are not that many. With topic modeling, it is important to check the output by digging into the documents themselves in order to see if they make sense.
does that relate to the parameter of default number of topics given? forcing the topic numbers to be a certain number or that's not the reason; I'm trying to calculate coherence but it's not possible if topics are empty
It is not necessarily related to the parameter of the default number of topics given. Most likely it is related to the documents that make up the topic. If you have a significant number of documents that are empty, then it actually makes sense that they are put into a separate topic since they are quite similar to one another. The solution is straightforward, simply remove documents that are (near-)empty by doing something like selected_docs = [doc for doc in docs if len(doc) >= 5]
.
I see, thank you for your suggestion, I tried to check content of docs but they seem ok, but still I have empty tuples in topics and when I'm trying to get coherence scores I get the following error:
ValueError: unable to interpret topic as either a list of tokens or a list of ids
I see, thank you for your suggestion, I tried to check content of docs but they seem ok, but still I have empty tuples in topics
To be on the same page, so the documents in the topics with empty tuples are not empty? Do you get non-empty documents when you run topic_model.get_representative_docs(topic=my_empty_tuples_topic)
?
If those documents are not empty and not extremely small, then it might be helpful if I can get a bit more information about your use case:
- Could you share your entire code for training BERTopic?
- Which version of BERTopic are you using?
- How many documents are you giving to BERTopic?
- How many topics are created?
- Name: bertopic Version: 0.10.0 -regarding the documents it's weird because if I give smaller amount of my corpus (10000 sentences) then coherence calculation works, besides the fact I still get topics with empty tuples ('', 1e-05), but when I'm increasing the size it breaks (I'm trying to figure out if smth related to input itself causes the issue) -143 topics
topic_model = BERTopic(verbose=True, embedding_model="xxx")
topics, _ = topic_model.fit_transform(mylist)
cleaned_docs = topic_model._preprocess_text(mylist)
vectorizer = topic_model.vectorizer_model
tokenizer = vectorizer.build_analyzer()
words = vectorizer.get_feature_names()
tokens = [tokenizer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)if words!='']
for topic in range(len(set(topics))-1)]
topic_words = list(filter(lambda t: '' not in t, topic_words))
coherence_model = CoherenceModel(topics=topic_words,
texts=tokens,
corpus=corpus,
dictionary=dictionary,
coherence='c_v')
coherence = coherence_model.get_coherence()
print(coherence)
Could you share what kind of documents you get when you run topic_model.get_representative_docs(topic=my_empty_tuples_topic)
? Also, could you also give the output of topic_model.get_topic(topic=my_empty_tuples_topic)
? That way, we can be concrete about the output.
Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!