Add HDBSCAN

I found this issue. I wanted to my code to be useful, so compiled it into hdbscan.jl so it works as is. I changed some code from my original to get closer to the code of this repo.

Abstract of functions and structures

HDBSCANGraph: is used to build a minimum-spanning-tree
HDBSCANCluster: is used to build cluster based on minimum-spanning-tree
HDBSCANResult: is used to return the result
hdbscan!: main function that performs hdbscan

As I wrote in comment, many utility functions are just converted from numpy by myself, so I don't know many about them.

Usage

This is the usage of main function.

hdbscan!(points::AbstractMatrix, k::Int64, min_cluster_size::INt64; gen_mst::Bool=true, mst=nothing)

Parameters

points: the d×n matrix, where each column is a d-dimensional coordinate of a point
k: we will define "core distance of point A" as the distance between point A and the k th neighbor point of point A.
min_cluster_size: minimum number of points in the cluster
gen_mst: whether to generate minimum-spannig-tree or not
mst: when is specified and gen_mst is false, new mst won't be generated

Example

I checked that this following code is available:

# include hdbscan.jl before run this code
using CSV
using DataFrames
using Plots
data = CSV.read("/home/watasu/Documents/code/HorseML.jl/test/data/clustering2.csv", DataFrame) |> Matrix
result = hdbscan!(data, 5, 3)
plot(title = "Clustering by HDBSCAN")
result = result.labels
for i in -1 : maximum(result)
    X = data[findall(result.==i), :]
    plot!(X[:, 1], X[:, 2], st=:scatter)
end
plot!()

Mar 22 '24 16:03 MommaWatasu

Codecov Report

:x: Patch coverage is 97.22222% with 4 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 95.56%. Comparing base (b4df21a) to head (dc5dd40). :warning: Report is 20 commits behind head on master.

Files with missing lines	Patch %	Lines
src/hdbscan.jl	97.39%	3 Missing :warning:
src/unionfind.jl	96.55%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #273      +/-   ##
==========================================
+ Coverage   95.40%   95.56%   +0.15%     
==========================================
  Files          20       22       +2     
  Lines        1503     1647     +144     
==========================================
+ Hits         1434     1574     +140     
- Misses         69       73       +4

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Mar 22 '24 16:03 codecov-commenter

@MommaWatasu Thanks for the PR! I think it would be a useful addition to the package. I will try to review it soon. Meanwhile -- it looks like there are no unit tests for the method. Could you please add some?

Mar 22 '24 17:03 alyst

I added some test for hdbscan. But I'm sorry for not being able to write unit test for utility functions. If you want to write tests for them, you should check scipy for their specification. They are coming from:

erf: scipy.special.erf
logpdf: I think the original code is this

Mar 23 '24 00:03 MommaWatasu

@MommaWatasu Thanks for adding unit tests. Generally we don't test internal utility functions, we only test public API, but aim to cover both meaningful examples (small datasets with nontrivial clusterings) and corner cases (e.g. single data point). As for some utility functions - erf/erfc are available in Distributions.jl, which Clusterings.jl already depends upon, so you should rather use these ones. pdf/logpdf are also declared there.

Mar 23 '24 10:03 alyst

I noticed that erf and logpdf aren't for HDBSCAN (but for Xmeans). I deleted them and updated test. And also, I added simple docs since Documentation workflow failed without it.

Mar 23 '24 14:03 MommaWatasu

I noticed that erf and logpdf aren't for HDBSCAN (but for Xmeans). I deleted them and updated test.

Great, erf is provided by SpecialFunctions.jl, but as we want to be conservative on the number of package dependencies, it is convenient that we don't need it.

Mar 23 '24 19:03 alyst

@alyst I applied all your suggestions. Is there anything else I have to fix?

Apr 27 '24 15:04 MommaWatasu

I applied all your suggestions. Is there anything else I have to fix?

@MommaWatasu Thank you for adjusting the PR! We are getting close to be able to merge, but we would need one more iteration.

Note that I have pushed some adjustments to the code directly to your branch, so please make sure to pull them first.

TODO items:

at the moment the tests that you have do not really test whether the clusters are correct. In fact, at the moment all the points are assigned to the noise cluster (I think it was also the case before my changes). Please add the test(s) that the assignments/clusters are correct. Ideally, we need to test a more complex clustering (more than 2 clusters), also would be nice to test that changing ncore, or min_cluster_size affects the result
Now that I understand the algorithm a bit more, it looks like HdbscanCluster is an internal structure, and it is rather a HdbscanTree node than the cluster you return to the user. It contains the fields like stability, children etc: some of them we should not expose to the user, the others you don't really set when you are preparing the resulting clusters. I suggest that you rename this structure into HdbscanNode (this is a non-exported structure for the algorithm), and create a new one, HdbscanCluster (the exported one returned to the user). If there are any properties of the cluster, such as stability or being noise -- please make sure to add these relevant fields to the HdbscanCluster and properly initialize them when you are generating the result.
HdbscanResult should inherit from ClusteringResult and support its API (in particular, counts is missing)
move UnionFind to a separate source file unionfind.jl. I'm not 100% sure it belongs to Clustering.jl. We may potentially use the DataStructures.jl, but this code looks rather compact, so I think I prefer keeping it over depending on another big package.
cleanup UnionFind terminology. I think set_id/issameset/items would be an optimal choice.
add the unit tests for UnionFind, e.g. that finding root, issameset, unite! work as expected
add the newlines in the docstrings that separate declaration from the description
add newlines that separate struct fields from the inner constructor

Apr 27 '24 21:04 alyst

@alyst I have one thing to apology. I found that the reason why the clustering result went wrong was my serious mistake about the algorithm. I fixed the algorithm and checked the result is correct. In addition, all the TODO items have been done(but I couldn't add the unit test about ncore because I don't know how it effects to the result).

I would appreciate it if you could check for any performance issues regarding the fixed algorithm.

May 02 '24 02:05 MommaWatasu

Hi, thanks for the effort of bringing HDBSCAN to Clustering.jl! I wonder what is the current status of this issue?

Aug 30 '24 22:08 sztal

I would also be keen to get this merged soon, it looks great :)

Apr 10 '25 00:04 BenCurran98

Clustering.jl
Clustering.jl copied to clipboard

Add HDBSCAN from HorseML.jl

Add HDBSCAN

Abstract of functions and structures

Usage

Parameters

Example

Codecov Report

Clustering.jl Clustering.jl copied to clipboard

Add HDBSCAN from HorseML.jl

Add HDBSCAN

Abstract of functions and structures

Usage

Parameters

Example

Codecov Report

Clustering.jl
Clustering.jl copied to clipboard