hdbscan Introduce sample weighting in HDBSCAN

Like for DBSCAN it would be very useful to also have the possibility to use a "sample_weight" parameter when performing HDBSCAN clustering trough the fit method. This to take into account possible presence of data containing element duplicates.

DBSCAN is just using sample_weight in the following way:

n_neighbors = np.array([np.sum(sample_weight[neighbors]) for neighbors in neighborhoods])

do you know if this is already planned at same time or not?

Dec 05 '17 16:12 prighini

Its certainly a feature that has been discussed and planned in some sense, but we have not implemented it yet, and at the moment the developers are committed to other projects so time to work on this right now is at a premium. We hope at some point in the not too distant future to get back and do an overhaul of hdbscan, including features like this. In the meantime I don't believe that this is too difficult to achieve and I would gladly accept a pull request if you or someone else wanted to put in the work.

Dec 05 '17 16:12 lmcinnes

OK sure in that case it's our intention to implement it. Could you just help us indicating where (in which script/function) in your opinion would be ideal to have that modification to make it as effective and general as possible (e.g. algo independent) avoiding any possible and unwanted side effect?

Dec 05 '17 16:12 prighini

Unfortunately there are a bunch of places that amendments would be required since the current library supports a number of approaches to the algorithm for the sake of flexibility and completeness. At heart what needs to be done is to allow for different core distance computations that can account for the weighting -- that, in turn, can be done with the weight vector and the relevant knn information. The relevant places where changes would be required are:

In hdbscan/hdbscan_.py:

Lines 172-175 or so
Lines 204-207 or so

In hdbscan/_hdbscan_reachability.pyx

Lines 44 to 51 or so
Lines 77 to 83 or so

In hdbscan/_hdbscan_boruvka.pyx

Lines 397 to 436 or so
Lines 1005 to 1033 or so

I believe that should cover all the cases that need to be worried about. Of course there is also the necessity of passing the weight vector all the way through to these functions which will require modify the call chain down to that point as well, but that should be traceable.

Dec 05 '17 17:12 lmcinnes

@lmcinnes , just to let you know, those other bug reports come from @prighini and me starting to dig into the code to implement this, at least for our isolated use case. Cython is a bit of a blocker here, but we're slowly progressing.

From what I understand, the weights need to be used when calculating the core distances/mutual reachability distances (for the radius?) and once again when condensing the cluster tree, such that the min_cluster_size is meet. Does this sound like a good direction?

For _hdbscan_generic this should be straightforward, as we have a distance_matrix and we can modify the _hdbscan_reachability.pyx around the lines you gave, but I'm a bit puzzled by _hdbscan_boruvka.pyx, are the _core_dist_query in line 412 and core_dist_tree.query in line 420 good places to modify the core distances?

It looks like we will focus on _hdbscan_boruvka.pyx first, and maybe try to generalise later.

Dec 13 '17 00:12 m-dz

The boruvka stuff is just messy for a couple of reasons. First I just kept hacking away 'til I could get something working that ran as fast as possible, so it isn't really the cleanest. Second, and to be honest this is most of it, the algorithm itself is just plain hard when you get to the nitty gritty details, so the whole thing is rather complicated. In practice there isn't that much you need to worry about, just the core distance computation; the rest should take care of itself (I hope).

Here's a rough outline of what's going on there: one way or another (line 412 or line 420) we are just getting a pair of vectors knn_indices and knn_dist. The ith row of knn_indices is an array of the indices of the nearest neighbors of the ith data point. Similarly the ith row of knn_dist is the distances to the nearest neighbors of the ith data point. Given those computing the core distance in the normal way is simply a matter of picking the min_samplesth distance from each row on the knn_dist array (that's what line 426 is doing). On the other hand I think you want to make use of the knn_indices array and the weight array to work out what index (rather than min_samples) into the knn_dist array you want to look (for each and every row).

Is that of some help?

Dec 13 '17 00:12 lmcinnes

@lmcinnes , I believe I have a working Boruvka KD-Tree path, could you have a look at: https://github.com/m-dz/hdbscan/tree/sample_weighting ?

Underlying objects are correct, just the plots are not picking up the right sizes and colours, but before even looking at them I would like to confirm this is the right direction.

Also, I am not much experienced in Python, not even mentioning Cython, so I bet everything can be further optimised...

I am using a following toy example:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from hdbscan import HDBSCAN

palette = sns.color_palette("tab10", 4)

# Generate data
test_data = np.array([[0.0, 0.0], [0.2, 0.0], [1.0, 1.0], [0.8, 1.0], [1.0, 0.8], [0.8, 0.8],  [0.0, 1.0], [0.0, 0.8], [0.2, 1.0], [0.2, 0.8]])
sample_weights = np.array([5, 1, 1, 1, 1, 1, 1, 1, 1, 1])
sizes = [3*80 if n > 1 else 80 for n in sample_weights]

# Unweighted HDBSCAN
# Just to check for errors
np.random.seed(1)
hdb_unweighted = HDBSCAN(min_cluster_size=3, gen_min_span_tree=True, allow_single_cluster=False)
hdb_unweighted.fit(test_data)

# Weighted HDBSCAN
np.random.seed(1)
hdb_weighted = HDBSCAN(min_cluster_size=3, gen_min_span_tree=True, allow_single_cluster=False, algorithm='boruvka_kdtree')
hdb_weighted.fit(test_data, sample_weights=sample_weights)

# Colours and sizes are off, but clusters are selected correctly
cluster_colours = [palette[col] if col >= 0 else (0.0, 0.0, 0.0) for col in hdb_weighted.labels_]
fig = plt.figure()
plt.scatter(test_data.T[0], test_data.T[1], s=sizes, c=cluster_colours)
fig.suptitle('Weighted HDBSCAN'); plt.show()
fig = plt.figure()
hdb_weighted.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.6, node_size=80, edge_linewidth=5)
fig.suptitle('Weighted HDBSCAN Minimum spanning tree'); plt.show()
fig = plt.figure()
hdb_weighted.single_linkage_tree_.plot(cmap='viridis', colorbar=True, vary_line_width=True)
fig.suptitle('Weighted HDBSCAN Single linkage tree'); plt.show()
fig = plt.figure()
hdb_weighted.condensed_tree_.plot(select_clusters=True, selection_palette=palette, log_size=False)
fig.suptitle('Weighted HDBSCAN condensed tree plot'); plt.show()

Edit: Added algorithm='boruvka_kdtree'.

Dec 14 '17 16:12 m-dz

Thanks for doing this! I am pretty busy at the moment with some other projects, but I will endeavour to take a look at this as soon as I can -- I'm quite keen to get such extra features merged in. I really appreciate the work you have been putting in on this, so please don't mistake my slowness is reviewing it for lack of enthusiasm about the project or your efforts.

Dec 14 '17 23:12 lmcinnes

Sure, it was a really good exercise! I will do my best to get all of the "paths" covered, but your comments and amendments might be necessary to finish this feature. Not to mention they will be extremely valuable for me as well.

Dec 15 '17 01:12 m-dz

I looked through the code added to the boruvka_kdtree core distance computation and that all looks good. First of all it is good to see that performance for the non-weighted case is not altered. For the weighted case I suspect you will find performance is (possibly) lower than you would like. The main thing from a Cython point of view to improve it would be to type declare some of the variables you are using. Something like

cdef np.intp_t idx
cdef np.ndarray[np.intp_t, ndim=1] idx_row
cdef np.double_t cs
cdef np.intp_t weighted_idx

would do the job. You might also want to consider making the generator/next into an actual for loop with a break (which I suspect Cython will find easier). On the other hand, it may make little difference, and the generator approach is certainly cleaner -- profile/benchmark to be sure.

In short it looks great, and all I have is minor nitpicks for potential performance sake. Hopefully the other code paths are easier (I believe they will be very similar for the most part).

Dec 15 '17 16:12 lmcinnes

@lmcinnes , is there a better channel to discus how we should PR the weighting? There are some "design" decisions to be made which would be good to discuss, e.g. whether you would be happy to just implement weighting to some of the algorithms (KD-Tree and Balltree Boruvka in particular) with an internal check like if sample_weights not None and algorithm not in weighting_supported_algorithms: throw_an_error etc.?

Dec 22 '17 15:12 m-dz

I don't have a whole lot of experience with this, but ultimately the best thing to do is make a [WIP] labelled pull request and use that as the place to review and discuss.

Dec 22 '17 16:12 lmcinnes

Hi, just a quick update, we're waiting for a company internal approval of our OSS contribution, which should happen soon. After that a [WIP] PR will arrive!

Jan 16 '18 11:01 m-dz

Excellent, I'm definitely looking forward to this. I haven't had much time for hdbscan work lately so it has been lagging a little -- it will be great to get some new features up and running, and see some (much needed) refactoring done. Definitely greatly appreciated.

Jan 16 '18 18:01 lmcinnes

Hi, is there any new status on introducing sample weighting in HDBSCAN?

Jan 07 '19 16:01 jonathanrave

Hi, is there any new status on introducing sample weighting in HDBSCAN?

Hi, I am afraid not, we are using the "work in progress" branch from https://github.com/scikit-learn-contrib/hdbscan/pull/165 for some time already without any issues, but it is definitely not supported within the module itself.

Jan 07 '19 16:01 m-dz

Hi, have there been any updates on the weighted hdbscan since then ?

Nov 28 '22 13:11 RomainFrog

We want weights in HDBSCAN master! 🙌 Is there a way we can help you with this? It seems there is an implementation that only needs merging.

Aug 11 '23 08:08 Visdoom

As far as I know it is still a work in progress; and given changes to HDBSCAN since it was first proposed it is not so easy to merge anymore either. At this point I would actually encourage any new features for HDBSCAN to be added to the HDBSCAN implementation in sklearn, so you can potentially propose a PR there. Notably that implementation doesn't (yet) contain the dual-tree boruvka implementation so adding sample weights would be a lot easier at this point.

Another option would be to make a PR for sample weights to fast_hdbscan which is a much simpler pure python implementation and so likely to be easier to add a feature like this to.

Aug 11 '23 17:08 lmcinnes

hdbscan hdbscan copied to clipboard

Introduce sample weighting in HDBSCAN

hdbscan
hdbscan copied to clipboard