hdbscan
hdbscan copied to clipboard
approximate_predict assigns unforeseen labels
Hi! I have trained a model which uses 7 labels:
set(clusterer.labels_)
# result: {-1, 0, 1, 2, 3, 4, 5, 6, 7}
but if I use the model with unseen data I obtain other labels:
selected_reds = np.array([[-0.03070436, 7.75012684],
[-0.14502905, 7.75161695],
[ 0.0684749 , 7.80699873],
[ 0.03331913, 7.69571781]])
hdbscan.approximate_predict(clusterer, selected_reds)
# result: (array([8, 8, 8, 8]), array([0.61420229, 0.61254361, 0.56375918, 0.68376323]))
And I have 'new labels' up to 35, while the maximum should be 7 (right?).
I'm aware that the approximate_predict function should be used with care, but is this behavior expected?
And in case, how can I avoid to get unforeseen labels with this new data?
EDIT: Someone else has the same issue, check it on StackOverflow
I have the exact same issue, approximate_predict always generates new labels (usually doubling the number of labels). Also, the membership strengths for these new labels are very high (between 0.8 and 1.0)
If that's a feature, it would be great to have it documented but that seems like a bug to me.
See also https://github.com/scikit-learn-contrib/hdbscan/issues/361 (same as the SO link from above)
Same here, thanks for bringing this up. Is there any explanation / solution for this?