BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Retrieve topics for a given document, and other questions

Open ziqizhang opened this issue 3 years ago • 2 comments

Hi, I have a number of questions and I hope it is ok that I ask them together in one post!

So the context is that I have successfully trained a model on my corpus and produced a series of visualisations including

  • the heat map (.visualize_heatmap(), choosing 20 topics)
  • the topic viz (.visualize_topics(), choosing 20 topics)

My questions are

  1. Is there a way to product a topic distribution/probability for one given input document? I notice there is a method .get_representative_docs(), but it only shows 3 documents.
  2. The topics are indexed by number. Does the number mean anything? E.g., is topic no. 0 more dominant/frequent than topic no. 3?
  3. I noticed that on the visuals created above, the heatmap shows topics from no.0 to 19. But the topic visualisation shows topics from no.1 to 19. Is this normal? Why is topic 0 discarded from the topic visualisation?

Many thanks!!

ziqizhang avatar Jul 01 '22 15:07 ziqizhang

I'm not an expert, but here goes:

Is there a way to product a topic distribution/probability for one given input document? I notice there is a method .get_representative_docs(), but it only shows 3 documents.

if you set `calculate_probabilities=True' you will generate a probability for each document. However, this can be quite resource expensive. Search through issues for more info.

The topics are indexed by number. Does the number mean anything? E.g., is topic #0 more dominant/frequent than topic https://github.com/MaartenGr/BERTopic/pull/3?

BERTopic re-orders the labels returned from the clustering algorithm (default HDBSCAN) 0...n descending by occurrence. There is an issue if the -1 are not the largest group - but since you should almost always discount -1 it shouldn't be an issue.

I noticed that on the visuals created above, the heatmap shows topics from #0 to 19. But the topic visualisation shows topics from https://github.com/MaartenGr/BERTopic/issues/1 to 19. Is this normal? Why is topic 0 discarded from the topic visualisation? I believe that is a bug - see #546 for a hot fix.

Hope that helps.

drob-xx avatar Jul 01 '22 16:07 drob-xx

Thanks, this is really helpful, thanks!

ziqizhang avatar Jul 01 '22 17:07 ziqizhang