BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

hdbscan metrics

Open The-Ineffable-Alias opened this issue 2 years ago • 3 comments

what metrics are probably a good idea for high-dimensional topic modeling? (i know cosine is one of them for other use cases at least)

The-Ineffable-Alias avatar Jul 06 '22 19:07 The-Ineffable-Alias

That highly depends on your specific use case. By high-dimensional topic modeling, do you mean the clustering task in HDBSCAN or the dimensionality reduction step in UMAP? Generally, cosine is used in UMAP as it is one of the metrics for which SentenceTransformers is optimized.

MaartenGr avatar Jul 07 '22 08:07 MaartenGr

with HDBSCAN*

The-Ineffable-Alias avatar Jul 10 '22 23:07 The-Ineffable-Alias

Although HDBSCAN can handle cosine distance metrics (through either using the metric="cosine" parameter or l2 normalizing your data), there are not that many distance metrics that can handle very high dimensionality well. Typically, bringing them down to a lower-dimensional space is much more accurate.

MaartenGr avatar Jul 11 '22 08:07 MaartenGr