BERTopic
BERTopic copied to clipboard
How do I get the topic of each document in the data
Hello developers! Recently, I am using BERTopic to deal with some data on food safety. I particularly want to know how to get the topic corresponding to each document in this set of data .
Do you have any idea how to do that ? Thank you
Not exactly clear what you are asking but when you call BERTopic.fit_transrorm()
it returns two arrays.
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)
topics in this case would be an array where n = the number of documents passed in with docs - so each doc is assigned a topic. probs
is returned if calculate_probabilities
is set to True
, which will also significantly increase processing time.
Sorry this is my expression of the question, "so each doc is assigned a topic." I want to know how to get each doc belongs to which topic.
such as :
"They call it the Space Launch System, or SLS, and it's a colossus. " belongs to topic 1
.get_representative_docs()
Can get representative docs per topic, and I want to know how to get every docs per topic. helppppp
@kaifeijidezxb The returned value from fit_transform as I indicated will give you a per-document topic assignment - that sounds like it is what you want.
Sorry I'm a bit stupid, but could you please show me which code to use to get the returned value. Thank you !!! @drob-xx
like.get_probs
?????
@kaifeijidezxb Definitely not a stupid question and this is something more people struggle with!
The package follows, to a certain extent, sklearn's API in that whenever you use transform
on a set of documents, it will return the topics in the same order. Let's say you have the following code:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
Here, docs
is a list of documents on which you train the model. Running .fit_transform(docs)
will return the variable topics
. In topics
, you will find the topics that belong to each documents. The topic in topics[0]
corresponds to the document in docs[0]
, topics[1]
to docs[1]
, etc.
You can use that structure to extract the documents under a certain topic by using, for example, the following:
import pandas as pd
results = pd.DataFrame({"Doc": docs, "Topic": topics})
The results
variable should contain everything you are looking for 😄
@drob-xx @MaartenGr Amazing! I solved the problem! Thank you both so much!!