hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Strange behaviour clustering 1D data

Open SergioG-M opened this issue 3 years ago • 5 comments

Hi,

I am getting a strange clustering when trying to cluster 1D data. As you can see in the plot below, sometimes I get a cluster with noise inside it. I can't see any reason for this to happen. Is this a bug? or am I missing something?

image

You can reproduce this result with the csv attached and

labels = HDBSCAN(min_cluster_size=100,
                       min_samples=10,
                       allow_single_cluster=True,
                       metric='l1',
                       core_dist_n_jobs=-1).fit_predict(df[['b']])

dummy.csv

SergioG-M avatar Nov 26 '21 15:11 SergioG-M

Have you looked at the condensed tree plots or the single_linkage_tree plots? Those would be the places that I'd start with to try and explore what is happening.

https://hdbscan.readthedocs.io/en/latest/advanced_hdbscan.html

On Fri, Nov 26, 2021 at 10:24 AM SergioG-M @.***> wrote:

Hi,

I am getting a strange clustering when trying to cluster 1D data. As you can see in the plot below, sometimes I get a cluster with noise inside it. I can't see any reason for this to happen. Is this a bug? or am I missing something?

[image: image] https://user-images.githubusercontent.com/61322372/143602216-2fbf4c4c-39f5-4ce7-a6bd-19877f3144a7.png

You can reproduce this result with the csv attached and

labels = HDBSCAN(min_cluster_size=100, min_samples=10, allow_single_cluster=True, metric='l1', core_dist_n_jobs=-1).fit_predict(df[['b']])

dummy.csv https://github.com/scikit-learn-contrib/hdbscan/files/7609546/dummy.csv

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWUKXKM332XRRGLNJILUN6RELANCNFSM5I24P7SA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jc-healy avatar Nov 26 '21 21:11 jc-healy

Yes, but I don't know what to make of them, I don't see any reason for this to happen

image

SergioG-M avatar Nov 26 '21 21:11 SergioG-M

Thanks for adding that plot. It confirms that there was indeed a single cluster with a single label. I know that is what your legend indicated but your results looked weird enough to my eyes that I really wanted to verify that. That inspired me to download your example and spend a bit of my weekend having a look in slightly greater depth.

The way to diagnose what is happening is to look at the minimum spanning tree which is being generated by this data. hdbscan works by constructing the mst on the mutual reachability distance space and then repeatedly cutting edges (think single linkage clustering). I dumped the mst to a graphml file and loaded it into gephi for a bit of interactive exploration (you can use whatever). I've attached an image of what the mst looks like (with a customized layout for emphasis). The blue points are labelled as noise by hdbscan. The red points in all in the same single cluster. The thing to note is that the mst isn't just a single chain. It has small spurs branching off from it. Some of those spurs are locally non-dense enough that they are pruned from the dominant cluster earlier than expected.

I don't think it's a bug in the code. I do think that it is an odd (and perhaps undesirable) quirk of the minimum spanning tree. I've included some code for grabbing the mst if you'd like to dive into this in more depth.

Thanks for bringing this to our attention. It's certainly something to think about.

df = pd.read_csv("/Users/jchealy/Downloads/dummy.csv")
model = hdbscan.HDBSCAN(min_cluster_size=100,
                       min_samples=10,
                       allow_single_cluster=True,
                       metric='l1',
                       core_dist_n_jobs=-1,
                       approx_min_span_tree=False,
                       gen_min_span_tree=True)
labels = model.fit_predict(df)

mst = model.minimum_spanning_tree_.to_networkx()
label_map = {i:label for i, label in enumerate(labels) }
value_map = {i:value for i, value in enumerate(df.b) }

networkx.set_node_attributes(mst, label_map, "cluster")
networkx.set_node_attributes(mst, value_map, "data")

networkx.write_graphml(mst, 'univariate_cluster_mst.graphml')

one_dimensional_mst

jc-healy avatar Nov 27 '21 18:11 jc-healy

Hi,

Since I first encountered this, I found more datasets where this happens. And I don't think it makes any sense to get disconnected clusters in 1D (moreover, I don't see any reason why this shouldn't happen as well in higher dimensions, although it will be much more difficult to notice)

I would like to understand why it does so, as I don't see how the mutual reachability distance can produce this difference

Anyone else has ever encounter this? Could it be solved by trying a different metric or a different set of parameters' I think this must happen as well in higher dimensions, but in 1D it is clear that it is undesirable.

SergioG-M avatar Dec 22 '21 19:12 SergioG-M

Not the same that before, but just another example that something is not right, I am trying to cluster a sample that only takes 68 different values, but I get different clusters for the same values

image image

I attach the data used for this example, but I see this everytime. I just checked other data that I cluster in 3 dimensions and also get points assigned to two clusters strange_cluster.csv

SergioG-M avatar Dec 23 '21 07:12 SergioG-M