hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

approximate_predict assigns unforeseen labels

Open vlavorini opened this issue 4 years ago • 3 comments

Hi! I have trained a model which uses 7 labels:

set(clusterer.labels_)
# result: {-1, 0, 1, 2, 3, 4, 5, 6, 7}

but if I use the model with unseen data I obtain other labels:

selected_reds = np.array([[-0.03070436,  7.75012684],
                          [-0.14502905,  7.75161695],
                          [ 0.0684749 ,  7.80699873],
                          [ 0.03331913,  7.69571781]])

hdbscan.approximate_predict(clusterer, selected_reds)

# result: (array([8, 8, 8, 8]), array([0.61420229, 0.61254361, 0.56375918, 0.68376323]))

And I have 'new labels' up to 35, while the maximum should be 7 (right?).

I'm aware that the approximate_predict function should be used with care, but is this behavior expected?

And in case, how can I avoid to get unforeseen labels with this new data?

EDIT: Someone else has the same issue, check it on StackOverflow

vlavorini avatar Jan 04 '21 14:01 vlavorini

I have the exact same issue, approximate_predict always generates new labels (usually doubling the number of labels). Also, the membership strengths for these new labels are very high (between 0.8 and 1.0)

If that's a feature, it would be great to have it documented but that seems like a bug to me.

nbeuchat avatar Jan 19 '21 18:01 nbeuchat

See also https://github.com/scikit-learn-contrib/hdbscan/issues/361 (same as the SO link from above)

DWFlanagan avatar Apr 21 '21 14:04 DWFlanagan

Same here, thanks for bringing this up. Is there any explanation / solution for this?

horsto avatar Jun 10 '21 16:06 horsto