BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

problems with merge_topics

Open iamsha5q opened this issue 2 years ago • 4 comments

I created the following dataframe from the model output

topics, probs = model.fit_transform(vic_msg)
topic_df = model.get_topic_info()

And then I created another dataframe which consist of my messages, the topic from model output, and assign the highest probability topic when message is assigned to topic -1.

# create dataframe with topics
df = pd.DataFrame({'topic': topics, 'message': vic_msg})
df['topic_assigned'] = " "
for i, row in df.iterrows():
    if row.topic == -1:
        df.at[i,'topic_assigned'] = np.where(probs[i] == probs[i].max())[0][0]
    else:
        df.at[i,'topic_assigned'] = row.topic
df = df.merge(topic_df[['Topic', 'Name']], how='left', left_on='topic_assigned', right_on='Topic' )
df.rename(columns = {'Name':'topic_keywords'}, inplace = True)
df = df[['topic','topic_assigned', 'topic_keywords', 'message']]

df above works perfectly, until i decided to merge some topics as follow

topics_to_merge = [[141,142],[143,144]]
model.merge_topics(vic_msg, topics, topics_to_merge)

and then when i run df again, some messages are still assigned to topics that were deleted because of the topic merging. But when i run the topic_df it correctly showed the newly merged topic.

Say message[1] was allocated to topic 141, and before the topic merging if i do probs[1] or model.visualize_distribution(probs[1]) it will show some values. But not after merging.. I've reduced 140 topics to 115 topics. So any messages assigned to topics > 115 previously now have no topics to map.

When I run len(probs[1]) the size is still about 141 topics, which means the probs are not updated with the new probs from merging? but if i did the following i get an error

topics_merge, probs_merge = model.merge_topics(vic_msg, topics, topics_to_merge) TypeError: cannot unpack non-iterable NoneType object

Do you have any idea what happen here @MaartenGr ?

iamsha5q avatar Jul 31 '22 11:07 iamsha5q

Fixed it after running below found in another discussion. Thanks!

topics= model.map_predictions(model.hdbscan_model.labels) probs = hdbscan.all_points_membership_vectors(model.hdbscan_model) probs = model._map_probabilities(probs, original_topics=True)

iamsha5q avatar Jul 31 '22 12:07 iamsha5q

Hi @MaartenGr , turns out that i'm still having issues with this. After executing the above commands, I just realize the representative docs are not assigned correctly to the new topics after merging. I'm still confused on how to assign the new topics from merging to the documents. Any help is appreciated.

iamsha5q avatar Aug 09 '22 02:08 iamsha5q

@iamsha5q There is indeed currently a bug in merge_topics. It will be fixed in the next release but there will be some significant changes to the internal structure so a quick fix will come with a new full release as a PR will not cover it entirely.

Having said that, I believe you can fix it by running the following:

self._map_representative_docs()
updated_probs = self._map_probabilities(probs)

There is already quite some code for the new release, so I am hoping to get a PR in the coming weeks so that you can already use the fix.

MaartenGr avatar Aug 09 '22 07:08 MaartenGr

Thanks Maarten, I might just wait for the next release then. Even after the map_representative_docs() it's still not mapped properly.

iamsha5q avatar Aug 10 '22 23:08 iamsha5q

With the new release, this should be fixed! However, if you still run into any issues, please let me know.

MaartenGr avatar Sep 27 '22 08:09 MaartenGr