hdbscan HDBSCAN gives different clustering results if the order of the dataframe rows changes

import pandas as pd import hdbscan arr = [(0, 0, 0), (0, 0, 0), (0, 0, 0), (0, 1, 1), (0, 1, 0), (0, 1, 0), (0, 0, 0), (0, 0, 0)] df = pd.DataFrame(arr, columns=["A", "B", "C"])

clusterer = hdbscan.HDBSCAN(metric='jaccard', min_cluster_size=2, min_samples=2).fit(df) clusterer.labels_

This gives the results: array([0, 0, 0, 1, 1, 1, 0, 0])

However if I change the order of the rows, import pandas as pd import numpy as np arr = [(0, 1, 1), (0, 1, 0), (0, 0, 0), (0, 0, 0), (0, 0, 0), (0, 1, 0), (0, 0, 0), (0, 0, 0)] df = pd.DataFrame(arr, columns=["A", "B", "C"])

clusterer = hdbscan.HDBSCAN(metric='jaccard', min_cluster_size=2, min_samples=2).fit(df) clusterer.labels_

This gives a different results: array([-1, -1, -1, -1, -1, -1, -1, -1])

Can anyone help me? Thanks.

Dec 12 '18 22:12 wenzhong-zhao

+1. Just observed the same behavior while migrating codes to another environment, getting the exact same source data by different methods (csv vs SQL). Didn't realize the ordering of dataframe will matter (that much) and produce different clustering results.

If that's the case, is there any advice on how to order the dataframe in order to achieve consistent results? Thanks.

Dec 17 '18 07:12 foxan

Tested with DBSCAN which works fine. Can anyone help? Thanks.

Dec 19 '18 01:12 wenzhong-zhao

I've seen the same behavior.

@lmcinnes Is this plausible?

Aug 17 '20 18:08 ekerazha

bump I'm experiencing the same behaviour. Has anyone found out why this is/ is there a conceptual reason for this?

Jan 18 '21 13:01 joloppo

That is a bit strange. I tried to reproduce your error and wasn't able to do it in python 3.8 on a fresh install of hdbscan on a macbook. For my first run I got: array([0, 0, 0, 1, 1, 1, 0, 0])

For your second block of code I got: array([0, 0, 1, 1, 1, 0, 1, 1])

Though these results differ they differ in the expected way and are perfectly consistent with each other (apply the appropriate permutation and flip the cluster ids). I'm not entirely certain why you are seeing all your points getting defined as noise (-1) in your second run. Can folks who can reproduce this problem indicate what python version of python/hdbscan and what OS they are using?

On Mon, Jan 18, 2021 at 8:44 AM Jolo- [email protected] wrote:

bump I'm experiencing the same behaviour. Has anyone found out why this is/ is there a conceptual reason for this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/265#issuecomment-762259750, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWQDUWJK7NOVLI5HVS3S2Q3MJANCNFSM4GKCLVWA .

Jan 18 '21 21:01 jc-healy

MacBook running macOS Catalina 10.15.7 . hdbscan 0.8.26 scipy 1.5.3 scikit-learn 0.23.2 python version 3.7.9

I checked OPs example and get correct results. However for my own data I do not get the correct results.

Here is some simple example code and data which is causing this issue for me.

Using the arguments provided in the code, results in one of the samples being differently assigned. When using default arguments (no kwargs) when initialising the hdbscan clusterer, all data from sample_2.csv becomes unclassified (-1), much like OPs issue.

sample_files.zip

Feb 01 '21 01:02 joloppo

bump

Feb 16 '21 15:02 joloppo

hdbscan hdbscan copied to clipboard

HDBSCAN gives different clustering results if the order of the dataframe rows changes

hdbscan
hdbscan copied to clipboard