BERTopic
BERTopic copied to clipboard
Issue with supervised topic modeling approach to predict new documents
Hi Maarten,
First of all, I absolutely love the libraries you have created. We not only just use BERTopic but also KeyBERT extensively in our projects and love all the features, flexibility and tuning parameters these offer.
I have an issue with the BERTopic model that after training it on a supervised dataset, the results it produce on the new data have lots of noise. This might be slightly similar to #354 , #370 but not exactly the same. As you mentioned in #370 and #482, I am getting the same topics on the same training dataset after running the model with umap set to random_state = 42, which is good and what we expect.
However, when I train it on a training dataset (supervised training using labelled dataset) and use it to predict on new dataset, I get almost 60% of topics as -1 (noise). Both the datasets (training and new) are from the same domain i.e. https://www.consumerfinance.gov/data-research/consumer-complaints where I am using training /new dataset on the same product and issue just for different time range.
Below is the code that I use for Training and Prediction -
Training the model: (Similar to https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html)
import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
import pickle
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
training_dataset = pd.read_csv('complaints_2020.csv')
################## Data Preparation ##################
##Remove rows which have the text data as N/A
training_dataset = training_dataset.dropna(subset=['Consumer complaint narrative'])
##Transform training text to list
training_docs = training_dataset['Consumer complaint narrative'].to_list()
##Create categorical labels for topic training
training_dataset['Issue'] = pd.Categorical(training_dataset['Issue'])
##Transform your output to numeric
training_dataset['label'] = training_dataset['Issue'].cat.codes
##Transform category labels to list
categories = training_dataset['labels'].to_list()
################## Topic modelling ##################
##Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(training_docs, show_progress_bar=True)
##Set the random state in the UMAP model to prevent stochastic behavior
umap_model = UMAP(min_dist=0.0, metric='cosine', random_state=42)
##Train model
base_topic_model = BERTopic(umap_model=umap_model, verbose=True)
topics, probs = base_topic_model.fit_transform(training_docs, embeddings, y=categories)
##Transform same documents and check if the predicted topics are the same
new_topics, new_probs = base_topic_model.transform(training_docs, embeddings)
assert topics == new_topics
##Save and load model
base_topic_model.save("BERT-Topic_Trained_Model/bert-topic-classification-model")
trained_topic_model = BERTopic.load("BERT-Topic_Trained_Model/bert-topic-classification-model")
new_topics, new_probs = trained_topic_model.transform(training_docs, embeddings)
pickle.dump(topics, open("topics.pickle", "wb"))
assert topics == new_topics
Restart the notebook instance, basically to remove all variables etc.
Using the trained model for prediction (no labelled data):
import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
import pickle
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
predict_dataset = pd.read_csv('complaints_2021.csv')
################## Data Preparation ##################
##Remove rows which have the text data as N/A
predict_dataset = predict_dataset.dropna(subset=['Consumer complaint narrative'])
##Transform training text to list
predict_docs = predict_dataset['Consumer complaint narrative'].to_list()
################## Topic modelling ##################
##Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(predict_docs, show_progress_bar=True)
##Load saved instances
trained_topic_model = BERTopic.load("BERT-Topic_Trained_Model/bert-topic-classification-model")
topics = pickle.load(open("topics.pickle", "rb"))
##Extract new topics and compare with the previously saved topics
predicted_topics, predicted_probs = trained_topic_model.transform(predict_docs, embeddings)
This time around the predicted outcomes results in >60% of categories as -1 and only few are able to match with the trained topics. I tired using HBDSCAN and calculate_probabilities as well but could not get it reduce much. When I go through the -1 manually from the new data and based on the topics identified from the training dataset, I could match those. Do you have any recommendations on better converging the model to the pre-defined topics?
Update 1 - During prediction, once the model is loaded, I don't know if we can use HBDSCAN parameter or calculate_probabilities though. During training, per the above code, I hardly get any -1.
Per #354 , I also tried predicted_topics, predicted_probs = trained_topic_model.fit_transform(predict_docs, embeddings, y=topics)
but get an error of length mismatch. This is a possibility since during training the dataset size would be different from the new, to be predicted, dataset.
Note: I may not be able to use guided topic modeling approach here since the collection of words that contribute towards a particular topic are not unigrams as this approach expects. They are more like "never received", "false promise", "not as expected" and so on... and to add to this complexity of bag of words, the phrases mentioned maybe not be even in the same sequence. These are generated from KeyBERT though :).
Thank you for your kind words! Glad to hear that the libraries are helpful to you.
This is indeed troublesome. I am not entirely sure but I do believe some of it has to do with how HDBSCAN uses its approximate_predict
to generate predictions. In some cases, it may differ from what it was trained on and it might even be the case that the amount of data you pass to that function will actually influence individual predictions, which makes individual datapoints dependent on one another.
I think that the most straightforward thing you can do is simply replace HDBSCAN with something that more closely works with your use-case, like k-Means. It is definitely not ideal but it may help you make sure that inference more closely matches the training procedure.
Thank you for your reply Maarten. Another approach that I tried was to use probabilities while training, so that for the prediction, I can use it to find the closest match instead of -1.
# Define model
base_topic_model = BERTopic(umap_model=umap_model, verbose=True, calculate_probabilities=True)
# Fit and Train model
topics, probs = base_topic_model.fit_transform(training_docs, embeddings, y=categories)
then I am trying to merge the similar topics using merge_topics
and their corresponding documents together, provide custom labels using set_topic_labels
to those grouped ones and then save, load and predict. However, now with this approach, I land into another problem which is related to #632 as the merged topics are not updating the actual topics identified by fit_transform
. Is there a workaround to update the topics identified by fit_transform
by merging them together just the way reduce_topics
work?
# Further reduce topics
new_topics, new_probs = topic_model.reduce_topics(docs, topics, nr_topics=30)
something like below, then maybe I can rename them afterwards and use it further
new_topics, new_probs = topic_model.merge_topics(docs, topics, mg_topics=topics_to_merge)
@kkadu There is currently an issue with merge_topics
that does not fully update the topics in the model. A fix is coming for that but that might take a while. Instead, reduce_topics
should work for now.
Since .merge_topics
should be working correctly now, this issue will be closed. However, if you still run into any issues, please let me know!