BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Map cluster numbers to seed_topic_list clusters for guided clustering

Open akshaykekuda opened this issue 2 years ago • 1 comments

Hi, I came across feature of guided topic modeling which fits my use case of clustering sentences around cluster templates. I was looking at the code for this, and I see that the function _guided_topic_modeling does exactly this. I was wondering if we could have a feature where the topic assigned per doc corresponds to the item from the seed_topic_list. Right now I see that output after clustering assigns cluster number to terms randomly. It would be good to have cluster numbers that correspond to the cluster from the seed topic list. If this feature already exists, how should I make use of it?

akshaykekuda avatar Jul 10 '22 10:07 akshaykekuda

That is currently not possible as the guided topic modeling only guides the modeling and does not replicate the seeded topics. In practice, this means that whenever you set a number of seeded topics, then the resulting model will try to steer towards those topics. As a result, you will have a higher chance that these seeded topics will be created but it is no guarantee. Moreover, seeded topics also does not guarantee flat topics. For example, if you would seed the model with an abstract topic such as health, it might find smaller topics talking about health-related things, such as cancer or fitness.

In other words, tracking a seeded topic can be difficult as they may not actually represent what is in the documents but merely steer towards those topics.

Having said that, you could use .find_topics to find the topics that most closely match the seeded topics.

MaartenGr avatar Jul 11 '22 07:07 MaartenGr