hdbscan Clustering with constraints

Dear HDBSCAN developpers,

I'm a physicist using the HDBSCAN algorithm to analyze experimental results (I'm therefore not specialized in clustering!). It works quite well on my data, but I have the feeling I could improve the results by adding some cannot-link constraints.

Reading the doc, I don't have the feeling that there is a easy way to introduce constraints in the clustering algorithm (see for example https://doi.org/10.1007/978-3-540-72530-5_25) I therefore tried something: artificially increase the distance in the precomputed distance matrix for the data points that should satisfy a cannot-link constraint. It turns out that even with a very large distance for points that cannot be in the same cluster, some points can end up in the same cluster: apparently this is probably not the right way of doing this. Do you have any ideas how to deal with such constraints using your algorithm? Are you planning to introduce such a feature in the future? Best, Geoffroy

May 15 '19 09:05 gaubry

I don't have any immediate solutions to this. Increasing distances in the distance matrix only gets around direct connection, but potentially if there is a path that connects the two points through a sequence of other close points they will still get connected. Semi-supervised clustering is a challenging problem in general, and the "don't cluster" constraint is also much harder than semi-supervision that suggests that two points should be in the same cluster.

The best I can offer is that you could try using the semi-supervised or supervised version of UMAP and then clustering that with HDBSCAN. The other alternative is to try to find implementations of the C-DBSCAN approach you linked to.

May 15 '19 17:05 lmcinnes

@lmcinnes @gaubry The original paper introducing hdbscan (https://dl.acm.org/doi/10.1145/2733381) actually also offers a semi-supervised solution (section 5.3) for the clusters selection mechanism.

May 25 '20 08:05 yossibiton

@lmcinnes Adding in 'cannot-link' constraints is of significant interest to my research group. Is there any way to incentivize this feature to be implemented.

For what it's worth, we implemented a hacked version of this already that just repeats hdbscan clustering after finding cannot-link violations and setting the distance between all samples in the cluster and the first violation to be disconnected so that the merge that caused the violation won't happen in the next iteration of clustering. This is accomplished by walking up and down the hierarchy repeatedly to find violating samples and it is very slow. In theory, a modification to the algorithm that checks for cannot-link violations before merging would be sufficient. There would be a significant hit to performance for sure, but it would be much faster compared to the existing hacked solution.

Oct 27 '24 17:10 RichieHakim

@RichieHakim I think at this stage the best option is to look at fast_hdbscan which is a newer multicore implementation. A recent pull request added support for semi-supervised clustering which should enable things like what you are looking for.

Oct 28 '24 01:10 lmcinnes