hdbscan
hdbscan copied to clipboard
Inconsistent clustering results using raw data and 'precomputed' distance matrix
Hi,
Truly appreciate your amazing work on this project. Thanks!
When I play with the HDBSCAN.fit() API, I found it can take either a feature array or a distance matrix as input. So I created a set of toy samples on 2D plane, and <1> input them firstly as a feature array (of shape (num_samples, num_features=2)) to HDBSCAN.fit() with default metric='euclidean'; then, <2> I computed the euclidean distance matrix of these samples using sklearn.metrics.pairwise.pairwise_distances() and input it (of shape (num_samples, num_samples)) again to HDBSCAN.fit() but with metric='precomputed'.
However, the returned clustering results are inconsistent with each other. (note that the left-bottom 's-curve' is considered as one cluster in the left figure but two clusters in the right)
I'm a little bit confused about this result since the inputs are just different forms of the same points. The clustering results should be the same, right? If the results should be different due to different implementations, is there a way to make the clustering results exactly the same no matter the input data of which form are provided? Since I'm new to this, I'm not sure if I did something wrong. Please kindly help me on this issue, many thanks for your time. :)
The toy data and code that I used to generate the above figure is given as below.
Update:
When I set approx_min_span_tree=False, the clustering results based on feature array and distance matrix become almost the same. However, there is still some difference:
The first 10 'probabilities_' are printed: feature array: [ 1. 1. 0.73569114 0.80287432 0.8356183 1. 1. 1. 0.8499631 0.96531667] distance matrix: [ 1. 1. 0.73569114 0.80287432 0.8356183 1. 1. 1. 0.91576507 0.96531667]
The 'cluster_persistence_' are printed: feature array: [ 0.32875327 0.21143461 0.39062779 0.51435757 0.09232658 0.12809644 0.19578622] distance matrix: [ 0.32875327 0.21143461 0.39062779 0.51435757 0.09232658 0.11616268 0.19578622]
In addition, I also tried other metrics: for 'cityblock', the clustering results are exactly the same based on my experiments; for 'braycurtis', the clustering results become inconsistent again like the 'euclidean' case.
I was going to suggest that you need to set approx_min_span_tree to False, but I see you've already done that. The fact that even with that there are some differences is a little disconcerting. I practice there are different actual metric computations going on in the background (the feature vector is going to use the distance metrics computed by the kdtree/balltree code; precomputed will obviously use the sklearn metric computation) but I can't really see it producing variances that are that large. It could be due to the fact that we invert distances and for very close distances that could exaggerate any minor precision errors. I'll try to look into it when I get time. Thanks for the detailed report!
Many thanks for your prompt reply. Your comments really resolve my doubts. I'll just use the feature array as input and leave other computations to the background. :)
I got the same problem , precomputed and euclidean resulted in totally different clusters . attached is the feature vector csv ,
if metric='euclidean , cluster label is array([ 1, 1, 1, -1, 1, 0, 0, 2, 1, 2, -1, 0]) , which is right cluster . The parameters was clusterer = hdbscan.HDBSCAN(min_cluster_size=2,metric='euclidean')
if metric=precomputed , cluster label is array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]) , the parameters was clusterer =hdbscan.HDBSCAN(metric='precomputed',min_cluster_size=2,approx_min_span_tree=False)
on the other hand , if I changed feature vector to the below
blobs, labels = make_blobs(n_samples=2000, n_features=100)
the cluster result was the same when using euclidean and precomputed .
here is the csv for feature vector I was using . shape is (12,119)
Thanks, I'll look into this as soon as I get some time. This definitely seems like there may be a corner case bug somewhere. I definitely appreciate the data for reproducing, as that makes chasing down these sorts of problems a lot easier!
Let me know if there is anything i can help out . Thanks for this amazing project which is my main clustering tool .
Hello! I am interested in using HDBSCAN and I stumbled upon this. Has this issue been resolved yet?
Unfortunately no. I don't believe it is hard but I have simply not had the time to dig into the code and sort out exactly why it isn't working. Pull requests would be greatly appreciated!
Some comments:
The approx_min_spanning_tree=False
needs to be in the vector data rather than the distance matrix version. That may remedy some of these issues. The other issue is that there are actually some problems that may occur for very small distances which need to be corrected. I have code in a branch (neg_exp
) which should fix this and some other issues I had concerns about. You may wish to try playing with that branch -- a warning, it does a few other things differently so the clustering will almost certainly not be the same as the master branch produces, but should be more internally consistent within the branch.
Hey, is the neg_exp branch still the best way to address this?
Unfortunately yes. I had intended to write a new clustering library to address this and many other issues, but time has so far not permitted this. I remain hopeful that I will get a chance eventually.
Thanks. It might be nice to have a note of this in the docs.