hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

KeyError in `all_points_membership_vectors`

Open apcamargo opened this issue 5 years ago • 6 comments

I'm getting a KeyError while trying to use all_points_membership_vectors in a clusterer that was fit with a (54, 2)-shaped numpy array:

clusterer = hdbscan.HDBSCAN(
    min_samples=10,
    prediction_data=True,
    allow_single_cluster=True,
    core_dist_n_jobs=1,
).fit(data)
soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-66-3cbe7b8f8ddd> in <module>
      5     core_dist_n_jobs=1,
      6 ).fit(data)
----> 7 soft_clusters = hdbscan.all_points_membership_vectors(clusterer)

~/miniconda3/envs/py38/lib/python3.8/site-packages/hdbscan/prediction.py in all_points_membership_vectors(clusterer)
    536         clusterer.prediction_data_.exemplars,
    537         clusterer.prediction_data_.dist_metric)
--> 538     outlier_vecs = all_points_outlier_membership_vector(
    539         clusters,
    540         clusterer.condensed_tree_._raw_tree,

hdbscan/_prediction_utils.pyx in hdbscan._prediction_utils.all_points_outlier_membership_vector()

hdbscan/_prediction_utils.pyx in hdbscan._prediction_utils.all_points_outlier_membership_vector()

hdbscan/_prediction_utils.pyx in hdbscan._prediction_utils.all_points_per_cluster_scores()

KeyError: 54

When I changed min_samples from 10 to 5 I didn't get the error. Here's the data for reproduction.

apcamargo avatar May 30 '20 22:05 apcamargo

Upon further inspection, both clusterer._prediction_data.leaf_max_lambdas and clusterer.prediction_data_.cluster_tree are empty.

There's indeed a single cluster:

In [19]: clusterer.labels_                                                                                                                                                                                                                     
Out[19]: 
array([-1, -1, -1, -1, -1, -1, -1, -1,  0, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1,  0, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1,  0, -1])

apcamargo avatar Jun 02 '20 19:06 apcamargo

Is there any plan to fix this? This should work, even for a single cluster!

moi90 avatar Aug 27 '20 13:08 moi90

I don't have the time to fix this these days, so I think it is safest to assume the soft clustering is unmtaintained at this stage.

lmcinnes avatar Aug 27 '20 14:08 lmcinnes

Any ideas/directions how to make this work?

moi90 avatar Aug 29 '20 18:08 moi90

I think there may be an issue with code duplication that has fallen out of sync -- the membership / prediction data code needs to get a cluster tree, and perhaps the error lies there. The other alternative is that single cluster cluster trees do need special handling, so it might simply require a special handling of that case in the membership vectors.

On Sat, Aug 29, 2020 at 2:16 PM Simon-Martin Schröder < [email protected]> wrote:

Any ideas/directions how to make this work?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/388#issuecomment-683325063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBLJAYZ22SYCXZEKQB3SDFAW7ANCNFSM4NO4CGXQ .

lmcinnes avatar Aug 29 '20 18:08 lmcinnes

@lmcinnes I noted that in #410 you commented that the soft clustering is mostly deprecated at this point. As that was about 8 months ago, I just wanted to check in again and see if that was still the case. If it is, do you have any recommendations on how to achieve a similar probability vector approach, either through code corrections here or another library? We very much need it for a large project we're working on, so any recommendations would be helpful (and we'd be pleased to help fix it up via PRs here, if that makes the most sense, just need some updated guidance as to what you think would be the best path forward for that!)

emigre459 avatar Dec 09 '21 18:12 emigre459