Maarten Grootendorst
Maarten Grootendorst
That highly depends on your specific use case. By high-dimensional topic modeling, do you mean the clustering task in HDBSCAN or the dimensionality reduction step in UMAP? Generally, cosine is...
Although HDBSCAN can handle cosine distance metrics (through either using the `metric="cosine"` parameter or l2 normalizing your data), there are not that many distance metrics that can handle very high...
If you want to fix the topic size, I would advise using an algorithm like k-Means in BERTopic instead of HDBSCAN. k-Means allows you to select the number of clusters...
BERTopic already picks out outliers, that is the -1 topic that you will find when you run `topic_model.get_topic_info()` so there is no need to further pick outliers. However, if you...
With k-Means there are other tricks you can do, like finding the points in a cluster that are furthest from the center. I would advise increasing the number of topics...
There currently is a GPU-accelerated implementation by rapidsai that you can find [here](https://github.com/rapidsai/rapids-examples/tree/main/cuBERT_topic_modelling) that you can try out. I have yet to try it out but from what I have...
@p-dre A few days ago, I released BERTopic [v0.10.0](https://maartengr.github.io/BERTopic/changelog.html#version-0100) which allows you to use different models for HDBSCAN and UMAP. This also allows you to use the GPU-accelerated version of...
@kuchenrolle After using the `cuml.cluster.HDBSCAN` model, you can access the probabilities with `topic_model.hdbscan_model.probabilities_`. I am not entirely sure though whether we can use the `membership_vector` in cuml through the original...
@beckernick Interesting! Haven't seen such a pattern before but it definitely seems like it would fit nicely with the use cases described here. Assuming the goal is to have a...
Apologies for the late reply, life has been unexpectedly hectic lately! Indeed, preprocessing is typically not necessary as it can influence the embedding creation process. However, you can use the...