BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Scikit-learn's HDBSCAN Implementation

Open MaartenGr opened this issue 1 month ago • 0 comments

In a recent version of scikit-learn, I believe it was v1.3, HDBSCAN was implemented with base functionality. Considering scikit-learn is already a requirement of BERTopic it stands to reason to use that implementation instead of the original implementation since scikit-learn has more contributors. Moreover, common installation issues related to HDBSCAN might be alleviated with this.

There are a couple of issues worth mentioning:

  • Calculation of probabilities is if I'm not mistaken, not implemented in scikit-learn's HDBSCAN
    • A solution would be to use the cosine similarities as the default method of calculating probabilities
  • The feature set is smaller than the original implementation
  • Speed needs to be tested to identify whether this is worth it
  • Accuracy, whatever that means in this context, might also need some exploration

For those reading this, I'm interested to hear what you all think about this suggested change!

MaartenGr avatar Jun 03 '24 10:06 MaartenGr