hdbscan
hdbscan copied to clipboard
HDBSCAN gives different clustering results if the order of the dataframe rows changes
import pandas as pd import hdbscan arr = [(0, 0, 0), (0, 0, 0), (0, 0, 0), (0, 1, 1), (0, 1, 0), (0, 1, 0), (0, 0, 0), (0, 0, 0)] df = pd.DataFrame(arr, columns=["A", "B", "C"])
clusterer = hdbscan.HDBSCAN(metric='jaccard', min_cluster_size=2, min_samples=2).fit(df) clusterer.labels_
This gives the results: array([0, 0, 0, 1, 1, 1, 0, 0])
However if I change the order of the rows, import pandas as pd import numpy as np arr = [(0, 1, 1), (0, 1, 0), (0, 0, 0), (0, 0, 0), (0, 0, 0), (0, 1, 0), (0, 0, 0), (0, 0, 0)] df = pd.DataFrame(arr, columns=["A", "B", "C"])
clusterer = hdbscan.HDBSCAN(metric='jaccard', min_cluster_size=2, min_samples=2).fit(df) clusterer.labels_
This gives a different results: array([-1, -1, -1, -1, -1, -1, -1, -1])
Can anyone help me? Thanks.
+1. Just observed the same behavior while migrating codes to another environment, getting the exact same source data by different methods (csv vs SQL). Didn't realize the ordering of dataframe will matter (that much) and produce different clustering results.
If that's the case, is there any advice on how to order the dataframe in order to achieve consistent results? Thanks.
Tested with DBSCAN which works fine. Can anyone help? Thanks.
I've seen the same behavior.
@lmcinnes Is this plausible?
bump I'm experiencing the same behaviour. Has anyone found out why this is/ is there a conceptual reason for this?
That is a bit strange. I tried to reproduce your error and wasn't able to do it in python 3.8 on a fresh install of hdbscan on a macbook. For my first run I got: array([0, 0, 0, 1, 1, 1, 0, 0])
For your second block of code I got: array([0, 0, 1, 1, 1, 0, 1, 1])
Though these results differ they differ in the expected way and are perfectly consistent with each other (apply the appropriate permutation and flip the cluster ids). I'm not entirely certain why you are seeing all your points getting defined as noise (-1) in your second run. Can folks who can reproduce this problem indicate what python version of python/hdbscan and what OS they are using?
On Mon, Jan 18, 2021 at 8:44 AM Jolo- [email protected] wrote:
bump I'm experiencing the same behaviour. Has anyone found out why this is/ is there a conceptual reason for this?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/265#issuecomment-762259750, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWQDUWJK7NOVLI5HVS3S2Q3MJANCNFSM4GKCLVWA .
MacBook running macOS Catalina 10.15.7 . hdbscan 0.8.26 scipy 1.5.3 scikit-learn 0.23.2 python version 3.7.9
I checked OPs example and get correct results. However for my own data I do not get the correct results.
Here is some simple example code and data which is causing this issue for me.
Using the arguments provided in the code, results in one of the samples being differently assigned. When using default arguments (no kwargs) when initialising the hdbscan clusterer, all data from sample_2.csv becomes unclassified (-1), much like OPs issue.
bump