Improved condense hierarchy of HDBSCAN
Closes https://github.com/rapidsai/cuml/issues/7377
This PR optimizes the build_condensed_hierarchy of HDBSCAN.
Our previous implementation runs a top-down bfs tree traversal, where the GPU kernel is launched for every level of the tree. This is very slow because the tree is not balanced.
This PR introduces a bottom-up approach by pointer-chasing up to the parent on the CPU using omp threads. This is much faster without any accuracy loss in the final result.
Table below shows two main parts of our HDBSCAN implementation (build linkage, and condense).
adjusted_rand_score is computed against our implementation using brute force graph build + original GPU condense implementation.
BF + orig : Brute force MR graph build + original top-down GPU condense NND + orig: nn-descent MR graph build + original top-down GPU condense BF + new: Brute force MR graph build + new bottom-up CPU condense in this PR NND + new: nn-descent MR graph build + new bottom-up CPU condense in this PR
It seems like the number of omp threads affect the performance, but the time doesn't always scale linearly. I believe it depends on what the tree looks like.
I think the number of persistent nodes might matter, because if there are more persistent nodes, each thread is more likely to climb "less levels" of the tree (because it climbs until it runs into a persistent node).
The ratio of (persistent nodes) / (total internal nodes) might have to do with this.
Heuristics are determined after investigating that the persistent/internal node ratio does affect the perf of this CPU implementation
Changing target branch to main to target 26.02
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
force pushed because of issues while rebasing to the main branch
Following scikit-learn's implementation. This is faster than our GPU/CPU optimized versions.
Numbers below in seconds for the condense hierarchy step. Green is the latest changes in this PR. This implementation doesn't depend on the number of threads (no parallelism, just a single for loop). This top-down CPU approach is much faster because it doesn't explore or go down the trees of internal nodes whose parent node is already collapsed.
All run on same MST built from the dataset, and final adjusted_rand_score is 1.0 (perfect match) against scikit-learn's result.
@csadorf @divyegala Can I get another round of reviews on this PR? The latest version relevant to the previous reviews has been changed.
Previous version we dispatch a CPU bottom-up vs GPU top-down depending on heuristics
- Problem with the previous version
- Difficult to find heuristics
- The optimized bottom-up CPU version is slower than scikit-learn's condense implementation
New implementation we follow scikit-learn's condense implementation. This doesn't exploit parallelism, but is still faster than both of our versions because it doesn't explore or go down the trees of internal nodes whose parent node is already collapsed. (new benchmarks in the comment above)
Game changing performance improvement :)
/merge