hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Where is _leaf_cluster_label?

Open jgonggrijp opened this issue 4 years ago • 5 comments

Quote https://hdbscan.readthedocs.io/en/latest/faq.html#q-i-mostly-just-get-one-large-cluster-i-want-smaller-clusters:

If you are getting a single large cluster and a few small outlying clusters that means your data is essentially a large glob with some small outlying clusters – there may be structure to the glob, but compared to how well separated those other small clusters are, it doesn’t really show up. You may, however, want to get at that more fine grained structure. You can do that, and what you are looking for is leaf clustering _leaf_cluster_label .

Neither my HDBSCAN instance nor any of its attributes has a _leaf_cluster_label attribute. In fact, as far as I can tell, there is nothing in the hdbscan package with this name. Searching for this name on the web, I only find references back to the FAQ page. Is it outdated documentation? Planned but unrealized functionality? Something that should be imported from another package?

Any help would be greatly appreciated. Thanks in advance!

jgonggrijp avatar Feb 17 '21 16:02 jgonggrijp

It looks like a broken link in the docs; what you want is something like:

HDBSCAN(cluster_extraction_method="leaf")

See this section of the docs: https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#leaf-clustering

lmcinnes avatar Feb 17 '21 16:02 lmcinnes

Thanks for the fast response, that is really helpful. Is there a way to extract the leaf clusters from an already constructed tree, though? We are clustering 1.4M 100-dimensional w2v vectors, and clustering takes a whole week.

jgonggrijp avatar Feb 17 '21 16:02 jgonggrijp

If you have the model and or tree saved off then yes, you can extract leaf clusters directly, but it takes a bit of code. Probably the easiest way to get there would be something like:

original_tree = model.condensed_tree_
raw_tree = original_tree.to_numpy()
stability = hdbscan._hdbscan_tree.compute_stability(raw_tree)
(
    cluster_labels,
    cluster_probs,
    cluster_stabs,
) = hdbscan._hdbscan_tree. get_clusters(raw_tree, stability, "leaf")

Although that is largely from memory, so if it doesn't quite work I'm sure we can patch it up -- but I do believe that should do the job.

lmcinnes avatar Feb 17 '21 22:02 lmcinnes

Yes, that worked. Thanks again!

jgonggrijp avatar Feb 18 '21 14:02 jgonggrijp

@lmcinnes Hey, thank you for the code to generate the leaf_cluster_labels. I tried that and it worked. But, my concern here is, when I had original clusters for 1.5M samples, it looked like this.

[[     -1   63832]
 [      0 1495752]
 [      1     273]
 [      2     432]]

I used the code:

model= hdbscan.HDBSCAN(
                        min_samples=1,
                        min_cluster_size=200
                    )
model.fit(data)

and after using your code to generate the leaf_clusters, the noise in my data increased drastically. It went from 63k to 1.4M

[[     -1 1411527]
 [      0     273]
 [      1     432]
 [      2     204]
 [      3     267].....

Surely the number of clusters went from 3 to 343. But how can I deal with the noise in this case.? (FYI: I am using hdbscan with cuda)

Thank you.

preet2312 avatar Jan 27 '23 22:01 preet2312