BERTopic
BERTopic copied to clipboard
Extract Topic Assignments from Model
It seems that if a model is created without specifying nr_topics
then BERTtopic.hdbscan_model.labels_
will return the initial assignments. Reading through the code it looks like that when BERTopic.reduce_topics()
is called that a mapping between the initial and new assignments can be retrieved by calling BERTopic.topic_mapper.get_mappings()
. Is this correct? Is it reliable to use .labels_
as indicies to the mapping to get the new assignments?
It seems that if a model is created without specifying nr_topics then BERTtopic.hdbscan_model.labels_ will return the initial assignments.
When you do not specify nr_topics
, the topics in BERTopic.hdbscan_model.labels_
will still be mapped according to their frequency in order to make sure that the most frequent topics is assigned to topic 0. This means that BERTopic.hdbscan_model.labels_
will not be the same as the resulting topics
after running .fit_transform
Reading through the code it looks like that when BERTopic.reduce_topics() is called that a mapping between the initial and new assignments can be retrieved by calling BERTopic.topic_mapper.get_mappings()
Yes. Do note that it depends on where you want to map the topics from. Are they the original hdbscan topics or are they the sorted topics as mentioned above? Whether one or the other is true, you can use the original_topics
parameter to select the correct mapping.
Is it reliable to use .labels_ as indicies to the mapping to get the new assignments?
In practice, to get the correct labels, you would have to perform the following:
topics = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
Thanks! I'm +1 for something like .get_document_topic_labels(original_labels=False) in the future if you are so inclined.
You can now get the topics of the most recent fit with .topics_
, so this issue will be closed. If, however, you have any other questions regarding this, please let me know.