BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

How to get the respective topics, the name of each topic, the top n words of each topic and other data for news docs on which `transform()` is used?

Open yugkha3 opened this issue 1 year ago • 7 comments

Suppose I trained the model first and got the topics, representative docs etc of the training docs using .get_document_info():

topic_model = BERTopic(vectorizer_model=vectorizer_model, hdbscan_model=hdbscan_model, embedding_model=embedding_model)
topics, probs = topic_model.fit_transform(docs)
print(topic_model.get_document_info())

and now I am predicting topics over new_docs:

new_topics, new_probs = topic_model.transform(new_docs)

now how do I get the information which new_doc falls in which new_topic? Like how can I generate a list/df just same as .get_document_info() for the newer docs and its new topics?

yugkha3 avatar Feb 22 '24 15:02 yugkha3

now how do I get the information which new_doc falls in which new_topic?

You already have that information in since every topic in new_topics relates to a document in new_docs. They are ordered so you can simply match them yourself and construct whatever dataframe you need using BERTopic's internal attributes/functions to get the information you want.

MaartenGr avatar Feb 23 '24 09:02 MaartenGr

I've already got the new_topic, new_probs using topic_model.fit_transform(new_docs), but I wonder whether there are new top N words of each topic with regard to new_docs. If yes, how can I access to those new top N words?

LeongVan avatar Apr 30 '24 02:04 LeongVan

@LeongVan When you train BERTopic, you should run either .fit or .fit_transform. Each time you use that function, it will train the model from scratch and will overwrite any previous topics you have created. That's generally inherent to any .fit-like function.

So, when you use .fit_transform(new_docs), you are training a completely new topic model. To get the top n words, you can run the same functions as you would generally do. For instance, .get_topic_info or any of the attributes, like .topic_representations_.

MaartenGr avatar Apr 30 '24 08:04 MaartenGr

@MaartenGr Thanks. Now I've trained my topic model using training dataset, and inference on new_docs (this is my test dataset) using topic_model.transform(new_docs). In this case, the top N key words is the same with the training stage, is that right? I wonder whether inference stage will generate new top N key words for each topic.

LeongVan avatar Apr 30 '24 08:04 LeongVan

@LeongVan Yes, the top n keywords will remain the same if you are performing inference with .transform. If you want updated keywords, you would have to look at either training two separate models and merging them using .merge_models or use online learning instead (I suggest merge_models). Either way, if you purely want to perform inference, then it should never update the model.

MaartenGr avatar Apr 30 '24 09:04 MaartenGr

@MaartenGr Thanks! My training dataset is too large to train, so I trained my topic model using half of the training dataset. If I train two topic models using two parts of the training dataset separately (50% of dataset for each training), and then use .merge_models to generate a merged topic model, is this training strategy theoretically work? Or is this training strategy works the same (almost the same is acceptable) as the training using the whole dataset once?

LeongVan avatar Apr 30 '24 09:04 LeongVan

@LeongVan It is similar to training using the whole dataset but that depends on the size and contents of your dataset and the split you make. If you are interested in global, large, and abstract topics, then it is generally the same .

MaartenGr avatar May 04 '24 11:05 MaartenGr