hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Soft Clustering with precomputed distance matrix

Open IlyaOrson opened this issue 6 years ago • 12 comments

Hello! First of all, thanks a lot for this clustering method and the implementation, both are super cool!

I am trying to use Soft Clustering with the precomputed distance matrix since I am using an unconventional distance. There appears to be no method implemented for this right now. I understand this is a new experimental feature and wondered if this limitation is just temporary. Is it possible to add this functionality?

Just for reference, the following code build from the manual warns this:

from sklearn.datasets import make_blobs
import pandas as pd
blobs, labels = make_blobs(n_samples=2000, n_features=10)
pd.DataFrame(blobs).head()

from sklearn.metrics.pairwise import pairwise_distances
distance_matrix = pairwise_distances(blobs)
clusterer = hdbscan.HDBSCAN(metric='precomputed',
                             prediction_data=True)
clusterer.fit(distance_matrix)
clusterer.labels_

UserWarning: Cannot generate prediction data for non-vectorspace inputs -- access to the source data ratherthan mere distances is required!

IlyaOrson avatar Aug 16 '17 16:08 IlyaOrson

I believe it is a relatively fundamental obstruction at the present time. There may be some cases where it could be made to work, but I would have to think carefully about how best to build an API that would allow for that without being confusing for all the other cases. Sorry that I can't provide any better answers at this time.

lmcinnes avatar Aug 16 '17 16:08 lmcinnes

No rush at all, I will stay tuned. Thanks for this again!

IlyaOrson avatar Aug 16 '17 17:08 IlyaOrson

Hi! Thank you for all your work!

Is it the same with callables? Because I tried to execute the following code:

def userdist(x, y):
    distance = vincenty((x[0], x[1]), (y[0], y[1]), miles=True)
    return distance

clusterer = hdbscan.HDBSCAN(min_cluster_size=6,
                            min_samples=3,
                            metric=userdist,
                            prediction_data=True).fit(data[['latitude', 'longitude']]) 

I don't have any warning, but when I call all_points_membership_vectors(clusterer) on it, I notice that clusterer.prediction_data_ is None.

The error I have is the following:

/Users/nlassaux/hdbscan-clustering/env/lib/python2.7/site-packages/hdbscan/prediction.pyc in all_points_membership_vectors(clusterer)
    514     clusters = np.array(list(clusterer.condensed_tree_._select_clusters()
    515                              )).astype(np.intp)
--> 516     all_points = clusterer.prediction_data_.raw_data
    517 
    518     distance_vecs = all_points_dist_membership_vector(all_points,
AttributeError: 'NoneType' object has no attribute 'raw_data'

Can you explain why a custom metric is a special case for getting a soft clustering?

nlassaux avatar Sep 08 '17 00:09 nlassaux

The soft clustering is still fairly new, and I haven't pushed everything through properly. For now I'm making heavy use of sklearn's KDTree and BallTree, and while they support custom metrics they aren't explicitly cited in the allowed metrics, which is the easiest way to check if they can reasonably be used. That means that the algorithm falls back to other approaches, which don't support the soft clustering at this time.

lmcinnes avatar Sep 08 '17 02:09 lmcinnes

If you could add an issue with a feature request to ensure that callable metrics are supported for soft clustering I would appreciate it -- it will help stop this falling through the cracks later.

lmcinnes avatar Sep 08 '17 02:09 lmcinnes

Hello,

Thank you for developing such a great clustering library.

It would be really useful to have this feature available in the next release of hdbscan.

It would be great to have either the ability to use approximate_predict or membership_vector for a custom distance measure or being able to use the same methods for a pairwise_distance input.

Could I ask if there are any plans for this functionality to be added?

Thank you, Elena

elena-sharova avatar Nov 09 '18 11:11 elena-sharova

My current priorities are in developing a follow on clustering library that benefits from some newer theory and a lot of lessons learned from this library. Particularly when it comes to soft clustering this is very much the case. That means that in practice I do not have any near term plans to add such functionality myself. I would be more than happy to accept pull requests that add such functionality.

lmcinnes avatar Nov 09 '18 13:11 lmcinnes

Hi @lmcinnes ,

Was there any progress into getting prediction working for precomputed distances?

danielgeiszler avatar May 10 '20 21:05 danielgeiszler

Hello,

Any progress on getting prediction working with precomputed distance? I calculated cosine distance since it is not supported, but when I try predicting it does not work.

warrior-galaxy avatar May 27 '20 21:05 warrior-galaxy

It is unlikely to be available for precomputed distances any time soon. Sorry.

lmcinnes avatar May 28 '20 03:05 lmcinnes

I get this same error

UserWarning: Cannot generate prediction data for non-vectorspace inputs -- access to the source data ratherthan mere distances is required!

when using sqeuclidean as my distance metric. Is that to be expected @lmcinnes? I'm guessing under the hood any of the scipy distances are just doing the same thing and calculating a pre-computed metric?

kr-hansen avatar Mar 14 '22 23:03 kr-hansen

Checking again on the status (hopefully progress) on that thread- namely, using fuzzy/soft clustering with the precomputed distance matrix...

MH8775 avatar Oct 12 '22 11:10 MH8775