BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Get representative doc per Topic with other columns like rating, date of document

Open amrityap opened this issue 2 years ago • 1 comments

Hey Maarten, I was running BERTopic on user reviews of an app. My goal is to perform sentiment analysis on reviews per topic. I managed to get topics. But now I need to print the reviews per topic along with their sentiment label (1 or 0). topic_model.get_representative_docs() only print the reviews with their topic. Is there a way to keep other columns like sentiment label and star rating so I can perform sentiment analysis per topic?

amrityap avatar Aug 01 '22 15:08 amrityap

The package follows, to a certain extent, sklearn's API in that whenever you use transform on a set of documents, it will return the topics in the same order. Let's say you have the following code:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

Here, docs is a list of documents on which you train the model. Running .fit_transform(docs) will return the variable topics. In topics, you will find the topics that belong to each documents. The topic in topics[0] corresponds to the document in docs[0], topics[1] to docs[1], etc.

You can use that structure to extract the documents under a certain topic by using, for example, the following:

import pandas as pd
results = pd.DataFrame({"Doc": docs, "Topic": topics})

The results variable can then be extended with whatever metadata you have, like sentiment label and star rating.

MaartenGr avatar Aug 02 '22 05:08 MaartenGr

Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!

MaartenGr avatar Sep 27 '22 08:09 MaartenGr