Retrieve topics for a given document, and other questions
Hi, I have a number of questions and I hope it is ok that I ask them together in one post!
So the context is that I have successfully trained a model on my corpus and produced a series of visualisations including
- the heat map (.visualize_heatmap(), choosing 20 topics)
- the topic viz (.visualize_topics(), choosing 20 topics)
My questions are
- Is there a way to product a topic distribution/probability for one given input document? I notice there is a method .get_representative_docs(), but it only shows 3 documents.
- The topics are indexed by number. Does the number mean anything? E.g., is topic no. 0 more dominant/frequent than topic no. 3?
- I noticed that on the visuals created above, the heatmap shows topics from no.0 to 19. But the topic visualisation shows topics from no.1 to 19. Is this normal? Why is topic 0 discarded from the topic visualisation?
Many thanks!!
I'm not an expert, but here goes:
Is there a way to product a topic distribution/probability for one given input document? I notice there is a method .get_representative_docs(), but it only shows 3 documents.
if you set `calculate_probabilities=True' you will generate a probability for each document. However, this can be quite resource expensive. Search through issues for more info.
The topics are indexed by number. Does the number mean anything? E.g., is topic #0 more dominant/frequent than topic https://github.com/MaartenGr/BERTopic/pull/3?
BERTopic re-orders the labels returned from the clustering algorithm (default HDBSCAN) 0...n descending by occurrence. There is an issue if the -1 are not the largest group - but since you should almost always discount -1 it shouldn't be an issue.
I noticed that on the visuals created above, the heatmap shows topics from #0 to 19. But the topic visualisation shows topics from https://github.com/MaartenGr/BERTopic/issues/1 to 19. Is this normal? Why is topic 0 discarded from the topic visualisation?
I believe that is a bug - see #546 for a hot fix.
Hope that helps.
Thanks, this is really helpful, thanks!