hdbscan
hdbscan copied to clipboard
Introduce sample weighting in HDBSCAN
Like for DBSCAN it would be very useful to also have the possibility to use a "sample_weight" parameter when performing HDBSCAN clustering trough the fit method. This to take into account possible presence of data containing element duplicates.
DBSCAN is just using sample_weight in the following way:
n_neighbors = np.array([np.sum(sample_weight[neighbors]) for neighbors in neighborhoods])
do you know if this is already planned at same time or not?
Its certainly a feature that has been discussed and planned in some sense, but we have not implemented it yet, and at the moment the developers are committed to other projects so time to work on this right now is at a premium. We hope at some point in the not too distant future to get back and do an overhaul of hdbscan, including features like this. In the meantime I don't believe that this is too difficult to achieve and I would gladly accept a pull request if you or someone else wanted to put in the work.
OK sure in that case it's our intention to implement it. Could you just help us indicating where (in which script/function) in your opinion would be ideal to have that modification to make it as effective and general as possible (e.g. algo independent) avoiding any possible and unwanted side effect?
Unfortunately there are a bunch of places that amendments would be required since the current library supports a number of approaches to the algorithm for the sake of flexibility and completeness. At heart what needs to be done is to allow for different core distance computations that can account for the weighting -- that, in turn, can be done with the weight vector and the relevant knn information. The relevant places where changes would be required are:
In hdbscan/hdbscan_.py:
- Lines 172-175 or so
- Lines 204-207 or so
In hdbscan/_hdbscan_reachability.pyx
- Lines 44 to 51 or so
- Lines 77 to 83 or so
In hdbscan/_hdbscan_boruvka.pyx
- Lines 397 to 436 or so
- Lines 1005 to 1033 or so
I believe that should cover all the cases that need to be worried about. Of course there is also the necessity of passing the weight vector all the way through to these functions which will require modify the call chain down to that point as well, but that should be traceable.
@lmcinnes , just to let you know, those other bug reports come from @prighini and me starting to dig into the code to implement this, at least for our isolated use case. Cython is a bit of a blocker here, but we're slowly progressing.
From what I understand, the weights need to be used when calculating the core distances/mutual reachability distances (for the radius?) and once again when condensing the cluster tree, such that the min_cluster_size
is meet. Does this sound like a good direction?
For _hdbscan_generic
this should be straightforward, as we have a distance_matrix
and we can modify the _hdbscan_reachability.pyx
around the lines you gave, but I'm a bit puzzled by _hdbscan_boruvka.pyx
, are the _core_dist_query
in line 412 and core_dist_tree.query
in line 420 good places to modify the core distances?
It looks like we will focus on _hdbscan_boruvka.pyx
first, and maybe try to generalise later.
The boruvka stuff is just messy for a couple of reasons. First I just kept hacking away 'til I could get something working that ran as fast as possible, so it isn't really the cleanest. Second, and to be honest this is most of it, the algorithm itself is just plain hard when you get to the nitty gritty details, so the whole thing is rather complicated. In practice there isn't that much you need to worry about, just the core distance computation; the rest should take care of itself (I hope).
Here's a rough outline of what's going on there: one way or another (line 412 or line 420) we are just getting a pair of vectors knn_indices
and knn_dist
. The ith row of knn_indices
is an array of the indices of the nearest neighbors of the ith data point. Similarly the ith row of knn_dist
is the distances to the nearest neighbors of the ith data point. Given those computing the core distance in the normal way is simply a matter of picking the min_samples
th distance from each row on the knn_dist
array (that's what line 426 is doing). On the other hand I think you want to make use of the knn_indices
array and the weight array to work out what index (rather than min_samples
) into the knn_dist
array you want to look (for each and every row).
Is that of some help?
@lmcinnes , I believe I have a working Boruvka KD-Tree path, could you have a look at: https://github.com/m-dz/hdbscan/tree/sample_weighting ?
Underlying objects are correct, just the plots are not picking up the right sizes and colours, but before even looking at them I would like to confirm this is the right direction.
Also, I am not much experienced in Python, not even mentioning Cython, so I bet everything can be further optimised...
I am using a following toy example:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from hdbscan import HDBSCAN
palette = sns.color_palette("tab10", 4)
# Generate data
test_data = np.array([[0.0, 0.0], [0.2, 0.0], [1.0, 1.0], [0.8, 1.0], [1.0, 0.8], [0.8, 0.8], [0.0, 1.0], [0.0, 0.8], [0.2, 1.0], [0.2, 0.8]])
sample_weights = np.array([5, 1, 1, 1, 1, 1, 1, 1, 1, 1])
sizes = [3*80 if n > 1 else 80 for n in sample_weights]
# Unweighted HDBSCAN
# Just to check for errors
np.random.seed(1)
hdb_unweighted = HDBSCAN(min_cluster_size=3, gen_min_span_tree=True, allow_single_cluster=False)
hdb_unweighted.fit(test_data)
# Weighted HDBSCAN
np.random.seed(1)
hdb_weighted = HDBSCAN(min_cluster_size=3, gen_min_span_tree=True, allow_single_cluster=False, algorithm='boruvka_kdtree')
hdb_weighted.fit(test_data, sample_weights=sample_weights)
# Colours and sizes are off, but clusters are selected correctly
cluster_colours = [palette[col] if col >= 0 else (0.0, 0.0, 0.0) for col in hdb_weighted.labels_]
fig = plt.figure()
plt.scatter(test_data.T[0], test_data.T[1], s=sizes, c=cluster_colours)
fig.suptitle('Weighted HDBSCAN'); plt.show()
fig = plt.figure()
hdb_weighted.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.6, node_size=80, edge_linewidth=5)
fig.suptitle('Weighted HDBSCAN Minimum spanning tree'); plt.show()
fig = plt.figure()
hdb_weighted.single_linkage_tree_.plot(cmap='viridis', colorbar=True, vary_line_width=True)
fig.suptitle('Weighted HDBSCAN Single linkage tree'); plt.show()
fig = plt.figure()
hdb_weighted.condensed_tree_.plot(select_clusters=True, selection_palette=palette, log_size=False)
fig.suptitle('Weighted HDBSCAN condensed tree plot'); plt.show()
Edit: Added algorithm='boruvka_kdtree'
.
Thanks for doing this! I am pretty busy at the moment with some other projects, but I will endeavour to take a look at this as soon as I can -- I'm quite keen to get such extra features merged in. I really appreciate the work you have been putting in on this, so please don't mistake my slowness is reviewing it for lack of enthusiasm about the project or your efforts.
Sure, it was a really good exercise! I will do my best to get all of the "paths" covered, but your comments and amendments might be necessary to finish this feature. Not to mention they will be extremely valuable for me as well.
I looked through the code added to the boruvka_kdtree core distance computation and that all looks good. First of all it is good to see that performance for the non-weighted case is not altered. For the weighted case I suspect you will find performance is (possibly) lower than you would like. The main thing from a Cython point of view to improve it would be to type declare some of the variables you are using. Something like
cdef np.intp_t idx
cdef np.ndarray[np.intp_t, ndim=1] idx_row
cdef np.double_t cs
cdef np.intp_t weighted_idx
would do the job. You might also want to consider making the generator/next into an actual for loop with a break (which I suspect Cython will find easier). On the other hand, it may make little difference, and the generator approach is certainly cleaner -- profile/benchmark to be sure.
In short it looks great, and all I have is minor nitpicks for potential performance sake. Hopefully the other code paths are easier (I believe they will be very similar for the most part).
@lmcinnes , is there a better channel to discus how we should PR the weighting? There are some "design" decisions to be made which would be good to discuss, e.g. whether you would be happy to just implement weighting to some of the algorithms (KD-Tree and Balltree Boruvka in particular) with an internal check like if sample_weights not None and algorithm not in weighting_supported_algorithms: throw_an_error
etc.?
I don't have a whole lot of experience with this, but ultimately the best thing to do is make a [WIP] labelled pull request and use that as the place to review and discuss.
Hi, just a quick update, we're waiting for a company internal approval of our OSS contribution, which should happen soon. After that a [WIP] PR will arrive!
Excellent, I'm definitely looking forward to this. I haven't had much time for hdbscan work lately so it has been lagging a little -- it will be great to get some new features up and running, and see some (much needed) refactoring done. Definitely greatly appreciated.
Hi, is there any new status on introducing sample weighting in HDBSCAN?
Hi, is there any new status on introducing sample weighting in HDBSCAN?
Hi, I am afraid not, we are using the "work in progress" branch from https://github.com/scikit-learn-contrib/hdbscan/pull/165 for some time already without any issues, but it is definitely not supported within the module itself.
Hi, have there been any updates on the weighted hdbscan since then ?
We want weights in HDBSCAN master! 🙌 Is there a way we can help you with this? It seems there is an implementation that only needs merging.
As far as I know it is still a work in progress; and given changes to HDBSCAN since it was first proposed it is not so easy to merge anymore either. At this point I would actually encourage any new features for HDBSCAN to be added to the HDBSCAN implementation in sklearn, so you can potentially propose a PR there. Notably that implementation doesn't (yet) contain the dual-tree boruvka implementation so adding sample weights would be a lot easier at this point.
Another option would be to make a PR for sample weights to fast_hdbscan which is a much simpler pure python implementation and so likely to be easier to add a feature like this to.