KeyError in `all_points_membership_vectors`
I'm getting a KeyError while trying to use all_points_membership_vectors in a clusterer that was fit with a (54, 2)-shaped numpy array:
clusterer = hdbscan.HDBSCAN(
min_samples=10,
prediction_data=True,
allow_single_cluster=True,
core_dist_n_jobs=1,
).fit(data)
soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-66-3cbe7b8f8ddd> in <module>
5 core_dist_n_jobs=1,
6 ).fit(data)
----> 7 soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
~/miniconda3/envs/py38/lib/python3.8/site-packages/hdbscan/prediction.py in all_points_membership_vectors(clusterer)
536 clusterer.prediction_data_.exemplars,
537 clusterer.prediction_data_.dist_metric)
--> 538 outlier_vecs = all_points_outlier_membership_vector(
539 clusters,
540 clusterer.condensed_tree_._raw_tree,
hdbscan/_prediction_utils.pyx in hdbscan._prediction_utils.all_points_outlier_membership_vector()
hdbscan/_prediction_utils.pyx in hdbscan._prediction_utils.all_points_outlier_membership_vector()
hdbscan/_prediction_utils.pyx in hdbscan._prediction_utils.all_points_per_cluster_scores()
KeyError: 54
When I changed min_samples from 10 to 5 I didn't get the error. Here's the data for reproduction.
Upon further inspection, both clusterer._prediction_data.leaf_max_lambdas and clusterer.prediction_data_.cluster_tree are empty.
There's indeed a single cluster:
In [19]: clusterer.labels_
Out[19]:
array([-1, -1, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, 0, -1])
Is there any plan to fix this? This should work, even for a single cluster!
I don't have the time to fix this these days, so I think it is safest to assume the soft clustering is unmtaintained at this stage.
Any ideas/directions how to make this work?
I think there may be an issue with code duplication that has fallen out of sync -- the membership / prediction data code needs to get a cluster tree, and perhaps the error lies there. The other alternative is that single cluster cluster trees do need special handling, so it might simply require a special handling of that case in the membership vectors.
On Sat, Aug 29, 2020 at 2:16 PM Simon-Martin Schröder < [email protected]> wrote:
Any ideas/directions how to make this work?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/388#issuecomment-683325063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBLJAYZ22SYCXZEKQB3SDFAW7ANCNFSM4NO4CGXQ .
@lmcinnes I noted that in #410 you commented that the soft clustering is mostly deprecated at this point. As that was about 8 months ago, I just wanted to check in again and see if that was still the case. If it is, do you have any recommendations on how to achieve a similar probability vector approach, either through code corrections here or another library? We very much need it for a large project we're working on, so any recommendations would be helpful (and we'd be pleased to help fix it up via PRs here, if that makes the most sense, just need some updated guidance as to what you think would be the best path forward for that!)