BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Compare LDA, NMF, LSA with BERTopic (w/ embedding: all-MiniLM-L6-v2 + dim_red: UMAP + cluster: HDBSCAN)

Open abis330 opened this issue 9 months ago • 1 comments

Hi @MaartenGr ,

Given a dataset of texts, we want to extract topics using LDA, NMF, LSA and BERTopic (w/ embedding: all-MiniLM-L6-v2 + dim_red: UMAP + cluster: HDBSCAN).

In order to select the best algorithm for this dataset, there was an intuition that a mathematical combination of an applicable topic coherence measure and an applicable topic diversity measure was chosen to optimize. In one of previous issues, #90 , I observed that when calculating topic coherence, you treated concatenation of texts belonging to a cluster as a single document.

However, for calculating topic coherence for LDA, LSA and NMF, we simply get the BoW representation of given texts and calculate topic coherence.

To the best of my understanding, shouldn't we ensure that the corpus and dictionary passed to initialize CoherenceModel object from gensim.coherencemodel be the same between BERTopic and LSA/LDA/NMF, so that we can actually now compare values of topic coherence achieved for all algorithms and then select the one with highest topic coherence?

Apologies for such a long description.

Thanks, Abi

abis330 avatar May 23 '24 22:05 abis330