Allow setting cluster_selection_epsilon in hdbscan()
Malzer & Baum describe how adding a minimum threshold value of eps to HDBSCAN can help with 'micro-clusters' in high density regions. This is implemented in the hdbscan Python package as a cluster_selection_epsilon parameter. Perhaps it could be added to this package too?
@joeroe Do you know how to add this functionality? @peekxc is the original author of the HDBSCAN implementation.
I'm not familiar with the paper mentioned. That being said, it may be possible to post-process the info returned by the internal hdbscan calls to add this. After all, obtaining DBSCAN* clusters amounts to cutting linkage produced by hdbscan via cutree.
Otherwise, the computeStability function would have to be forked on the cpp-side, which is a bit involved.
It's a bit beyond me, to be honest. Since it's a modification to the cluster extraction algorithm, does it have to be in the Cpp code? Here's the PR that added it to Python's hdbscan: https://github.com/scikit-learn-contrib/hdbscan/pull/329. Perhaps @cmalzer or @lmcinnes could help?
My R is not great, and tracing it into the C++ I lose track a little of exactly what data structures we have, but if I read it tight you should be able to insert code here at the R level to do the epsilon checks and pick different clusters accordingly. There are different data structures in play; I think this is a lot easier if you have the condensed (in this package simplified) tree data structure available -- it isn't clear to me if that's actually exposed at the R level, or if you have to be in the C++.
Hi, I just made a pull request with a simple integration of the cluster_selection_epsilon parameter. I added some explanation to the pull request. Please double-check everything, since I'm not very familiar with R/RCPP and this repository in general.
This feature is now included.