BERTopic
BERTopic copied to clipboard
Get representative doc per Topic with other columns like rating, date of document
Hey Maarten, I was running BERTopic on user reviews of an app. My goal is to perform sentiment analysis on reviews per topic. I managed to get topics. But now I need to print the reviews per topic along with their sentiment label (1 or 0). topic_model.get_representative_docs() only print the reviews with their topic. Is there a way to keep other columns like sentiment label and star rating so I can perform sentiment analysis per topic?
The package follows, to a certain extent, sklearn's API in that whenever you use transform
on a set of documents, it will return the topics in the same order. Let's say you have the following code:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
Here, docs
is a list of documents on which you train the model. Running .fit_transform(docs)
will return the variable topics
. In topics
, you will find the topics that belong to each documents. The topic in topics[0]
corresponds to the document in docs[0]
, topics[1]
to docs[1]
, etc.
You can use that structure to extract the documents under a certain topic by using, for example, the following:
import pandas as pd
results = pd.DataFrame({"Doc": docs, "Topic": topics})
The results
variable can then be extended with whatever metadata you have, like sentiment label and star rating.
Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!