BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

empty topics

Open vistamou opened this issue 2 years ago • 7 comments

Hello,

I'm getting empty topics such as ('', 1e-05); is it due to a preprocessing step, do you have any idea what could be causing that? Thank you!

vistamou avatar Jul 07 '22 08:07 vistamou

It might be that the documents in that specific topic are (near-)empty which would also result in an empty topic. I would suggest exploring the documents on that topic to see if anything stands up. For that, you could use .get_representative_docs() or simply check all the documents belonging to that topic if there are not that many. With topic modeling, it is important to check the output by digging into the documents themselves in order to see if they make sense.

MaartenGr avatar Jul 07 '22 08:07 MaartenGr

does that relate to the parameter of default number of topics given? forcing the topic numbers to be a certain number or that's not the reason; I'm trying to calculate coherence but it's not possible if topics are empty

vistamou avatar Jul 07 '22 16:07 vistamou

It is not necessarily related to the parameter of the default number of topics given. Most likely it is related to the documents that make up the topic. If you have a significant number of documents that are empty, then it actually makes sense that they are put into a separate topic since they are quite similar to one another. The solution is straightforward, simply remove documents that are (near-)empty by doing something like selected_docs = [doc for doc in docs if len(doc) >= 5].

MaartenGr avatar Jul 08 '22 06:07 MaartenGr

I see, thank you for your suggestion, I tried to check content of docs but they seem ok, but still I have empty tuples in topics and when I'm trying to get coherence scores I get the following error:

ValueError: unable to interpret topic as either a list of tokens or a list of ids

vistamou avatar Jul 08 '22 07:07 vistamou

I see, thank you for your suggestion, I tried to check content of docs but they seem ok, but still I have empty tuples in topics

To be on the same page, so the documents in the topics with empty tuples are not empty? Do you get non-empty documents when you run topic_model.get_representative_docs(topic=my_empty_tuples_topic)?

If those documents are not empty and not extremely small, then it might be helpful if I can get a bit more information about your use case:

  • Could you share your entire code for training BERTopic?
  • Which version of BERTopic are you using?
  • How many documents are you giving to BERTopic?
  • How many topics are created?

MaartenGr avatar Jul 08 '22 07:07 MaartenGr

  • Name: bertopic Version: 0.10.0 -regarding the documents it's weird because if I give smaller amount of my corpus (10000 sentences) then coherence calculation works, besides the fact I still get topics with empty tuples ('', 1e-05), but when I'm increasing the size it breaks (I'm trying to figure out if smth related to input itself causes the issue) -143 topics
topic_model = BERTopic(verbose=True,  embedding_model="xxx") 
topics, _ = topic_model.fit_transform(mylist)
cleaned_docs = topic_model._preprocess_text(mylist)

vectorizer = topic_model.vectorizer_model
tokenizer = vectorizer.build_analyzer()


words = vectorizer.get_feature_names()
tokens = [tokenizer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)if words!=''] 
               for topic in range(len(set(topics))-1)]

topic_words = list(filter(lambda t: '' not in t, topic_words))


coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
coherence = coherence_model.get_coherence()
print(coherence)

vistamou avatar Jul 08 '22 09:07 vistamou

Could you share what kind of documents you get when you run topic_model.get_representative_docs(topic=my_empty_tuples_topic)? Also, could you also give the output of topic_model.get_topic(topic=my_empty_tuples_topic)? That way, we can be concrete about the output.

MaartenGr avatar Jul 09 '22 12:07 MaartenGr

Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!

MaartenGr avatar Sep 27 '22 08:09 MaartenGr