cuml Improved condense hierarchy of HDBSCAN

Closes https://github.com/rapidsai/cuml/issues/7377

This PR optimizes the build_condensed_hierarchy of HDBSCAN. Our previous implementation runs a top-down bfs tree traversal, where the GPU kernel is launched for every level of the tree. This is very slow because the tree is not balanced.

This PR introduces a bottom-up approach by pointer-chasing up to the parent on the CPU using omp threads. This is much faster without any accuracy loss in the final result.

Table below shows two main parts of our HDBSCAN implementation (build linkage, and condense). adjusted_rand_score is computed against our implementation using brute force graph build + original GPU condense implementation.

BF + orig : Brute force MR graph build + original top-down GPU condense NND + orig: nn-descent MR graph build + original top-down GPU condense BF + new: Brute force MR graph build + new bottom-up CPU condense in this PR NND + new: nn-descent MR graph build + new bottom-up CPU condense in this PR

Nov 07 '25 02:11 jinsolp

It seems like the number of omp threads affect the performance, but the time doesn't always scale linearly. I believe it depends on what the tree looks like.

I think the number of persistent nodes might matter, because if there are more persistent nodes, each thread is more likely to climb "less levels" of the tree (because it climbs until it runs into a persistent node). The ratio of (persistent nodes) / (total internal nodes) might have to do with this. Screenshot 2025-11-07 at 12 38 22 PM

Nov 07 '25 20:11 jinsolp

Heuristics are determined after investigating that the persistent/internal node ratio does affect the perf of this CPU implementation

Nov 12 '25 18:11 jinsolp

Changing target branch to main to target 26.02

Nov 20 '25 17:11 jinsolp

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Nov 21 '25 00:11 copy-pr-bot[bot]

force pushed because of issues while rebasing to the main branch

Nov 21 '25 00:11 jinsolp

Following scikit-learn's implementation. This is faster than our GPU/CPU optimized versions.

Numbers below in seconds for the condense hierarchy step. Green is the latest changes in this PR. This implementation doesn't depend on the number of threads (no parallelism, just a single for loop). This top-down CPU approach is much faster because it doesn't explore or go down the trees of internal nodes whose parent node is already collapsed.

All run on same MST built from the dataset, and final adjusted_rand_score is 1.0 (perfect match) against scikit-learn's result. Screenshot 2025-12-01 at 4 55 58 PM

Dec 02 '25 01:12 jinsolp

@csadorf @divyegala Can I get another round of reviews on this PR? The latest version relevant to the previous reviews has been changed.

Previous version we dispatch a CPU bottom-up vs GPU top-down depending on heuristics

Problem with the previous version
- Difficult to find heuristics
- The optimized bottom-up CPU version is slower than scikit-learn's condense implementation

New implementation we follow scikit-learn's condense implementation. This doesn't exploit parallelism, but is still faster than both of our versions because it doesn't explore or go down the trees of internal nodes whose parent node is already collapsed. (new benchmarks in the comment above)

Dec 03 '25 18:12 jinsolp

Game changing performance improvement :)

Dec 12 '25 21:12 beckernick

/merge

Dec 12 '25 22:12 jinsolp