BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

How do I get the topic of each document in the data

Open kaifeijidezxb opened this issue 2 years ago • 7 comments

Hello developers!  Recently, I am using BERTopic to deal with some data on food safety. I particularly want to know how to get the topic corresponding to each document in this set of data .

Do you have any idea how to do that ? Thank you

kaifeijidezxb avatar Jul 23 '22 17:07 kaifeijidezxb

Not exactly clear what you are asking but when you call BERTopic.fit_transrorm() it returns two arrays.

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)

topics in this case would be an array where n = the number of documents passed in with docs - so each doc is assigned a topic. probs is returned if calculate_probabilities is set to True, which will also significantly increase processing time.

drob-xx avatar Jul 23 '22 20:07 drob-xx

Sorry this is my expression of the question, "so each doc is assigned a topic." I want to know how to get each doc belongs to which topic.

such as :

"They call it the Space Launch System, or SLS, and it's a colossus. " belongs to topic 1

kaifeijidezxb avatar Jul 24 '22 06:07 kaifeijidezxb

image-20220724164812222

.get_representative_docs()Can get representative docs per topic, and I want to know how to get every docs per topic. helppppp

kaifeijidezxb avatar Jul 24 '22 08:07 kaifeijidezxb

@kaifeijidezxb The returned value from fit_transform as I indicated will give you a per-document topic assignment - that sounds like it is what you want.

drob-xx avatar Jul 24 '22 15:07 drob-xx

image-20220725000441629 Sorry I'm a bit stupid, but could you please show me which code to use to get the returned value. Thank you !!! @drob-xx

like.get_probs?????

kaifeijidezxb avatar Jul 24 '22 16:07 kaifeijidezxb

@kaifeijidezxb Definitely not a stupid question and this is something more people struggle with!

The package follows, to a certain extent, sklearn's API in that whenever you use transform on a set of documents, it will return the topics in the same order. Let's say you have the following code:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

Here, docs is a list of documents on which you train the model. Running .fit_transform(docs) will return the variable topics. In topics, you will find the topics that belong to each documents. The topic in topics[0] corresponds to the document in docs[0], topics[1] to docs[1], etc.

You can use that structure to extract the documents under a certain topic by using, for example, the following:

import pandas as pd
results = pd.DataFrame({"Doc": docs, "Topic": topics})

The results variable should contain everything you are looking for 😄

MaartenGr avatar Jul 24 '22 16:07 MaartenGr

@drob-xx @MaartenGr Amazing! I solved the problem! Thank you both so much!!

kaifeijidezxb avatar Jul 24 '22 18:07 kaifeijidezxb