BERTopic
BERTopic copied to clipboard
problems with merge_topics
I created the following dataframe from the model output
topics, probs = model.fit_transform(vic_msg)
topic_df = model.get_topic_info()
And then I created another dataframe which consist of my messages, the topic from model output, and assign the highest probability topic when message is assigned to topic -1.
# create dataframe with topics
df = pd.DataFrame({'topic': topics, 'message': vic_msg})
df['topic_assigned'] = " "
for i, row in df.iterrows():
if row.topic == -1:
df.at[i,'topic_assigned'] = np.where(probs[i] == probs[i].max())[0][0]
else:
df.at[i,'topic_assigned'] = row.topic
df = df.merge(topic_df[['Topic', 'Name']], how='left', left_on='topic_assigned', right_on='Topic' )
df.rename(columns = {'Name':'topic_keywords'}, inplace = True)
df = df[['topic','topic_assigned', 'topic_keywords', 'message']]
df above works perfectly, until i decided to merge some topics as follow
topics_to_merge = [[141,142],[143,144]]
model.merge_topics(vic_msg, topics, topics_to_merge)
and then when i run df again, some messages are still assigned to topics that were deleted because of the topic merging. But when i run the topic_df it correctly showed the newly merged topic.
Say message[1] was allocated to topic 141, and before the topic merging if i do probs[1] or model.visualize_distribution(probs[1]) it will show some values. But not after merging.. I've reduced 140 topics to 115 topics. So any messages assigned to topics > 115 previously now have no topics to map.
When I run len(probs[1]) the size is still about 141 topics, which means the probs are not updated with the new probs from merging? but if i did the following i get an error
topics_merge, probs_merge = model.merge_topics(vic_msg, topics, topics_to_merge)
TypeError: cannot unpack non-iterable NoneType object
Do you have any idea what happen here @MaartenGr ?
Fixed it after running below found in another discussion. Thanks!
topics= model.map_predictions(model.hdbscan_model.labels) probs = hdbscan.all_points_membership_vectors(model.hdbscan_model) probs = model._map_probabilities(probs, original_topics=True)
Hi @MaartenGr , turns out that i'm still having issues with this. After executing the above commands, I just realize the representative docs are not assigned correctly to the new topics after merging. I'm still confused on how to assign the new topics from merging to the documents. Any help is appreciated.
@iamsha5q There is indeed currently a bug in merge_topics
. It will be fixed in the next release but there will be some significant changes to the internal structure so a quick fix will come with a new full release as a PR will not cover it entirely.
Having said that, I believe you can fix it by running the following:
self._map_representative_docs()
updated_probs = self._map_probabilities(probs)
There is already quite some code for the new release, so I am hoping to get a PR in the coming weeks so that you can already use the fix.
Thanks Maarten, I might just wait for the next release then. Even after the map_representative_docs() it's still not mapped properly.
With the new release, this should be fixed! However, if you still run into any issues, please let me know.