BERTopic
BERTopic copied to clipboard
Scikit-learn's HDBSCAN Implementation
In a recent version of scikit-learn, I believe it was v1.3, HDBSCAN was implemented with base functionality. Considering scikit-learn is already a requirement of BERTopic it stands to reason to use that implementation instead of the original implementation since scikit-learn has more contributors. Moreover, common installation issues related to HDBSCAN might be alleviated with this.
There are a couple of issues worth mentioning:
- Calculation of probabilities is if I'm not mistaken, not implemented in scikit-learn's HDBSCAN
- A solution would be to use the cosine similarities as the default method of calculating probabilities
- The feature set is smaller than the original implementation
- Speed needs to be tested to identify whether this is worth it
- Accuracy, whatever that means in this context, might also need some exploration
For those reading this, I'm interested to hear what you all think about this suggested change!