hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Closest clusters are not consistent with the cluster labels

Open AndrewNg opened this issue 7 years ago • 27 comments

It seems that the closest_clusters are labeled with a different numbering system as compared with cluster.labels_. I ran the example code on some data and expected the closest_cluster to match up with the label when a given data point had a label (i.e., the data point was not noisy and was not -1). However, the labels did not match up: image

The expectation is that a data point with label 4 will also have closest cluster 4. After going through the rest of the data, label 4 and closest cluster 3 map to the same cluster, but they are numbered inconsistently.

AndrewNg avatar Aug 01 '17 15:08 AndrewNg

After some more testing, it looks like the label numbers are reversed. Here's a mapping of the closest_cluster label to label:

defaultdict(None, {3: 4, 5: 2, 6: 1, 2: 5, 0: 7, 4: 3, 1: 6, 7: 0})

AndrewNg avatar Aug 01 '17 17:08 AndrewNg

That's a little disconcerting. I'll see if I can track down where things are getting reversed - -I suspect there is some duplicated/copy-paste code on my part that got updated in one place and not the other. sorry about that.

lmcinnes avatar Aug 01 '17 20:08 lmcinnes

I noticed some similar when using the soft clustering. Let's say the algorithm finds 2 clusters, if you print the probabilities computed using hdbscan.all_points_membership_vectors(hdbscan_clusterer) it happens a scenario where the probabilities are inverted. For instance I noticed something similar to the following:

datapoint_id label probs 1 1 [0.8,0.2] 2 0 [0.3,0.7] 3 1 [0.7,0.3] 4 0 [0.1,0.9]

Working with my data I noticed something else that might help in finding the problem. If I print out the cluster ids using hdbscan_clusterer.condensed_tree_._select_clusters() the ids are not sorted when I have the problem while when the labels correspond to the right probabilities the select_cluster method outputs a sorted array of ids. I hope this could help.

I can also save in a cvs some toy data I have to reproduce this problem if you like.

CoffeRobot avatar Aug 11 '17 12:08 CoffeRobot

Hey @lmcinnes @AndrewNg . Any updates on this issue? I'm faced with a similar problem on my real world dataset with no apparent relation between probabilities and cluster labels (neither are the probabilities indices matching with cluster labels nor are they inverted). Is there a workaround to derive cluster label mapping based on the soft clustering probabilities?

However, I couldn't reproduce the same issue in the example code (digits dataset in the link above). Cluster index from membership vector (from numpy.argmax(all_points_membership_vectors)) is now equal to cluster labels in the digits dataset for all non-noise data points, so I assume some work has been done on this? Puzzling!?!

vivekbharadwaj avatar Oct 13 '17 02:10 vivekbharadwaj

Its worth mentioning that when I ran the clustering algorithm on a different market segment, the cluster labels did match the index of the membership vector probabilities.

after some more analysis on the previous dataset with the mismatch problem, I discovered a cyclic pattern with the sorting. The index of the highest probabilities in the membership vector start with 29 instead of 0 : screen shot 2017-10-13 at 1 35 25 pm

Happy to pm you a pickle if it helps you reproduce the problem...

vivekbharadwaj avatar Oct 13 '17 02:10 vivekbharadwaj

Sorry, I have been very busy with a number of other projects, and this was relatively low on the priority list (I was hoping to significantly overhaul the soft clustering at some point, and get to this then). I probably won't have time to get to this that soon either. I believe the problem should be relatively easy fix -- one needs to compare the cluster selection code from _hdbscan_tree.pyx and from the soft clustering and make sure they actually align properly. I would be more than happy to accept a PR, but can`t promise to get to this myself for a little while.

lmcinnes avatar Oct 13 '17 14:10 lmcinnes

Thanks for your prompt reply Leland. I'm unable to work on it since I'm travelling until next week. Might give it a go once I'm back.

vivekbharadwaj avatar Oct 19 '17 03:10 vivekbharadwaj

@lmcinnes @AndrewNg Any progress on this? I just ran into this issue as well...

gilgtc avatar May 09 '18 23:05 gilgtc

I believe this did get fixed actually, but due to some other patches elsewhere that intersected with this. What version of hdbscan are you running?

lmcinnes avatar May 10 '18 00:05 lmcinnes

I have the latest version I believe: hdbscan-0.8.13

I just ran it again to make sure and it still has the same behavior as described by OP

gilgtc avatar May 10 '18 00:05 gilgtc

Hmm, let me take a look again.

lmcinnes avatar May 10 '18 00:05 lmcinnes

@lmcinnes thanks for taking a look at this Leland. I am very eager to use this functionality and appreciate your time and effort.

gilgtc avatar May 11 '18 18:05 gilgtc

I have a proposed fix -- let me know if the current master resolves the issue for you.

lmcinnes avatar May 15 '18 00:05 lmcinnes

@lmcinnes I only had a short time to try but it seems that I still get the same behavior. I will try it again tonight on a simpler case and let you know. In the meantime, if anyone @AndrewNg could try it as well it would be helpful.

gilgtc avatar May 16 '18 22:05 gilgtc

@lmcinnes @AndrewNg

I ran the soft clustering example and still got some mixed results. Most of the cluster labels from clusterer.labels_ match the index of the top probability in hdbscan.all_points_membership_vectors(clusterer), but there are still a few which don't. Specifically, out of 814 data points, 798 are correctly identified but 16 are incorrect which is a bit disconcerting. See full example below:

from sklearn import datasets
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

digits = datasets.load_digits()
data = digits.data
projection = TSNE().fit_transform(data)
plt.scatter(*projection.T,**plot_kwds)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0 else (0.5, 0.5, 0.5) for x in clusterer.labels_]

cluster_member_colors = [sns.desaturate(x, p) for x, p in zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[np.argmax(x)] for x in soft_clusters]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_colors, alpha=0.25)

num_wrong = 0
num_right = 0
for c, sc in zip(clusters, soft_clusters):  
  if c > 0:
    if (c-np.argmax(sc)) != 0:
      num_wrong += 1
      print('(%d, %d)' %(c, np.argmax(sc)))
    else:
      num_right += 1
      
print('num_right = %d, num_wrong = %d' % (num_right, num_wrong))

The result is as follows (it only shows which didn't match correctly and at the end shows the total count for correct and incorrect:

(8, 7) (1, 3) (5, 11) (1, 3) (1, 6) (1, 6) (3, 8) (1, 6) (10, 11) (6, 10) (4, 2) (3, 6) (4, 9) (9, 10) (4, 9) (1, 6) num_right = 798, num_wrong = 16

gilgtc avatar May 18 '18 18:05 gilgtc

Thanks for the example. Unfortunately it looks like I'm not going to have time to dig into this until Tuesday. Hopefully it can wait until then, at which point I'll try to get into this properly and see if I can figure out what on earth is going astray.

lmcinnes avatar May 18 '18 19:05 lmcinnes

@lmcinnes no worries, thanks for taking a look.

gilgtc avatar May 21 '18 17:05 gilgtc

Digging in to this I think the answer (unfortunately?) is that this is "just how it works". The soft clustering considers the distance from exemplars, and the merge height in the tree between the point and each of the clusters. These points that end up "wrong" are points that sit on a split in the tree -- they have the same merge height to their own cluster (perhaps that is a bug, I'll look into it further). That means tree-wise we don't distinguish them, and in terms of pure ambient distance to exemplars they are closer to the "wrong" cluster, and so get misclassified. This is a little weird, but the soft clustering is ultimately a little different that the hard clustering, so corner cases like this can theoretically occur.

lmcinnes avatar May 22 '18 13:05 lmcinnes

@lmcinnes Thanks for looking at it, that makes sense. I'll keep an eye on this thread but unfortunately, at least as it is now, I don't think I will be able to use it because in my data set the number of "wrong" clusters is pretty high.

gilgtc avatar May 22 '18 17:05 gilgtc

I understand. I have plans for a different clustering algorithm that is more amenable to producing soft clustering via something along these lines, but likely rather more robustly. Sorry I couldn't be of more help at this time.

lmcinnes avatar May 22 '18 18:05 lmcinnes

@lmcinnes Cool, i look forward to that. Best of luck.

gilgtc avatar May 22 '18 20:05 gilgtc

Any updates on this algorithm @lmcinnes, is this incosistent labeling still an issue? Thanks!

ricsinaruto avatar Mar 08 '19 23:03 ricsinaruto

Unfortunately my time has been rapidly soaked up by other projects (largely UMAP), so I haven't had the opportunity to sit down and code up the new algorithm as I would desire it to be yet. I believe some fixes were put in place that *should address the inconsistent labelling, but I haven't actually checked, so I cant make any promises.

lmcinnes avatar Mar 09 '19 22:03 lmcinnes

Hello, any updates on the labelling issue in soft clustering? @lmcinnes @gilgtc @AndrewNg Thank you!

mik1904 avatar Jan 03 '20 12:01 mik1904

Not as yet, sorry.

lmcinnes avatar Jan 03 '20 17:01 lmcinnes

I met a data set, which nearly 90% data has different soft & hard cluster label....do we have any update since last year?

fgg1991 avatar Mar 01 '21 05:03 fgg1991

Since it appears this method isn't getting worked on, is there another method out there of determining the next best match for hdbscan data that people are using?

irvintim avatar Jul 13 '21 23:07 irvintim