Clustering.jl
Clustering.jl copied to clipboard
Add HDBSCAN from HorseML.jl
Add HDBSCAN
I found this issue. I wanted to my code to be useful, so compiled it into hdbscan.jl so it works as is. I changed some code from my original to get closer to the code of this repo.
Abstract of functions and structures
- HDBSCANGraph: is used to build a minimum-spanning-tree
- HDBSCANCluster: is used to build cluster based on minimum-spanning-tree
- HDBSCANResult: is used to return the result
- hdbscan!: main function that performs hdbscan
As I wrote in comment, many utility functions are just converted from numpy by myself, so I don't know many about them.
Usage
This is the usage of main function.
hdbscan!(points::AbstractMatrix, k::Int64, min_cluster_size::INt64; gen_mst::Bool=true, mst=nothing)
Parameters
points: the d×n matrix, where each column is a d-dimensional coordinate of a pointk: we will define "core distance of point A" as the distance between point A and thekth neighbor point of point A.min_cluster_size: minimum number of points in the clustergen_mst: whether to generate minimum-spannig-tree or notmst: when is specified andgen_mstis false, new mst won't be generated
Example
I checked that this following code is available:
# include hdbscan.jl before run this code
using CSV
using DataFrames
using Plots
data = CSV.read("/home/watasu/Documents/code/HorseML.jl/test/data/clustering2.csv", DataFrame) |> Matrix
result = hdbscan!(data, 5, 3)
plot(title = "Clustering by HDBSCAN")
result = result.labels
for i in -1 : maximum(result)
X = data[findall(result.==i), :]
plot!(X[:, 1], X[:, 2], st=:scatter)
end
plot!()
Codecov Report
:x: Patch coverage is 97.22222% with 4 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 95.56%. Comparing base (b4df21a) to head (dc5dd40).
:warning: Report is 20 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/hdbscan.jl | 97.39% | 3 Missing :warning: |
| src/unionfind.jl | 96.55% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #273 +/- ##
==========================================
+ Coverage 95.40% 95.56% +0.15%
==========================================
Files 20 22 +2
Lines 1503 1647 +144
==========================================
+ Hits 1434 1574 +140
- Misses 69 73 +4
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@MommaWatasu Thanks for the PR! I think it would be a useful addition to the package. I will try to review it soon. Meanwhile -- it looks like there are no unit tests for the method. Could you please add some?
I added some test for hdbscan. But I'm sorry for not being able to write unit test for utility functions. If you want to write tests for them, you should check scipy for their specification. They are coming from:
- erf: scipy.special.erf
- logpdf: I think the original code is this
@MommaWatasu Thanks for adding unit tests. Generally we don't test internal utility functions, we only test public API, but aim to cover both meaningful examples (small datasets with nontrivial clusterings) and corner cases (e.g. single data point). As for some utility functions - erf/erfc are available in Distributions.jl, which Clusterings.jl already depends upon, so you should rather use these ones. pdf/logpdf are also declared there.
I noticed that erf and logpdf aren't for HDBSCAN (but for Xmeans). I deleted them and updated test. And also, I added simple docs since Documentation workflow failed without it.
I noticed that erf and logpdf aren't for HDBSCAN (but for Xmeans). I deleted them and updated test.
Great, erf is provided by SpecialFunctions.jl, but as we want to be conservative on the number of package dependencies, it is convenient that we don't need it.
@alyst I applied all your suggestions. Is there anything else I have to fix?
I applied all your suggestions. Is there anything else I have to fix?
@MommaWatasu Thank you for adjusting the PR! We are getting close to be able to merge, but we would need one more iteration.
Note that I have pushed some adjustments to the code directly to your branch, so please make sure to pull them first.
TODO items:
- at the moment the tests that you have do not really test whether the clusters are correct. In fact, at the moment all the points are assigned to the noise cluster (I think it was also the case before my changes). Please add the test(s) that the assignments/clusters are correct. Ideally, we need to test a more complex clustering (more than 2 clusters), also would be nice to test that changing ncore, or min_cluster_size affects the result
- Now that I understand the algorithm a bit more, it looks like
HdbscanClusteris an internal structure, and it is rather aHdbscanTreenode than the cluster you return to the user. It contains the fields likestability,childrenetc: some of them we should not expose to the user, the others you don't really set when you are preparing the resulting clusters. I suggest that you rename this structure intoHdbscanNode(this is a non-exported structure for the algorithm), and create a new one,HdbscanCluster(the exported one returned to the user). If there are any properties of the cluster, such asstabilityor being noise -- please make sure to add these relevant fields to theHdbscanClusterand properly initialize them when you are generating the result. HdbscanResultshould inherit fromClusteringResultand support its API (in particular, counts is missing)- move
UnionFindto a separate source file unionfind.jl. I'm not 100% sure it belongs to Clustering.jl. We may potentially use the DataStructures.jl, but this code looks rather compact, so I think I prefer keeping it over depending on another big package. - cleanup
UnionFindterminology. I thinkset_id/issameset/itemswould be an optimal choice. - add the unit tests for
UnionFind, e.g. that finding root, issameset, unite! work as expected - add the newlines in the docstrings that separate declaration from the description
- add newlines that separate struct fields from the inner constructor
@alyst I have one thing to apology. I found that the reason why the clustering result went wrong was my serious mistake about the algorithm. I fixed the algorithm and checked the result is correct. In addition, all the TODO items have been done(but I couldn't add the unit test about ncore because I don't know how it effects to the result).
I would appreciate it if you could check for any performance issues regarding the fixed algorithm.
Hi, thanks for the effort of bringing HDBSCAN to Clustering.jl! I wonder what is the current status of this issue?
I would also be keen to get this merged soon, it looks great :)