BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Extract Topic Assignments from Model

Open drob-xx opened this issue 2 years ago • 2 comments

It seems that if a model is created without specifying nr_topics then BERTtopic.hdbscan_model.labels_ will return the initial assignments. Reading through the code it looks like that when BERTopic.reduce_topics() is called that a mapping between the initial and new assignments can be retrieved by calling BERTopic.topic_mapper.get_mappings(). Is this correct? Is it reliable to use .labels_ as indicies to the mapping to get the new assignments?

drob-xx avatar Jun 18 '22 21:06 drob-xx

It seems that if a model is created without specifying nr_topics then BERTtopic.hdbscan_model.labels_ will return the initial assignments.

When you do not specify nr_topics, the topics in BERTopic.hdbscan_model.labels_ will still be mapped according to their frequency in order to make sure that the most frequent topics is assigned to topic 0. This means that BERTopic.hdbscan_model.labels_ will not be the same as the resulting topics after running .fit_transform

Reading through the code it looks like that when BERTopic.reduce_topics() is called that a mapping between the initial and new assignments can be retrieved by calling BERTopic.topic_mapper.get_mappings()

Yes. Do note that it depends on where you want to map the topics from. Are they the original hdbscan topics or are they the sorted topics as mentioned above? Whether one or the other is true, you can use the original_topics parameter to select the correct mapping.

Is it reliable to use .labels_ as indicies to the mapping to get the new assignments?

In practice, to get the correct labels, you would have to perform the following:

topics = topic_model._map_predictions(topic_model.hdbscan_model.labels_)

MaartenGr avatar Jun 19 '22 06:06 MaartenGr

Thanks! I'm +1 for something like .get_document_topic_labels(original_labels=False) in the future if you are so inclined.

drob-xx avatar Jun 19 '22 18:06 drob-xx

You can now get the topics of the most recent fit with .topics_, so this issue will be closed. If, however, you have any other questions regarding this, please let me know.

MaartenGr avatar Sep 27 '22 08:09 MaartenGr