BERTopic
BERTopic copied to clipboard
hdbscan metrics
what metrics are probably a good idea for high-dimensional topic modeling? (i know cosine is one of them for other use cases at least)
That highly depends on your specific use case. By high-dimensional topic modeling, do you mean the clustering task in HDBSCAN or the dimensionality reduction step in UMAP? Generally, cosine is used in UMAP as it is one of the metrics for which SentenceTransformers is optimized.
with HDBSCAN*
Although HDBSCAN can handle cosine distance metrics (through either using the metric="cosine"
parameter or l2 normalizing your data), there are not that many distance metrics that can handle very high dimensionality well. Typically, bringing them down to a lower-dimensional space is much more accurate.