hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Difference in clustering 0.8.27 vs 0.8.28

Open SergioG-M opened this issue 3 years ago • 9 comments

I have noticed a difference in clustering between the new 0.8.28 vs the 0.8.27, I ran the same code on seven files, and the results are different (sometimes quite different). Is there any default parameter that has changed in the new version? I can't find any documentation of the changes. If not, has anything changed that can explain this?

hbdscan_version.docx

SergioG-M avatar Feb 10 '22 14:02 SergioG-M

Nothing obvious leaps to mind. Can you give a small test case or specific example?

lmcinnes avatar Feb 10 '22 14:02 lmcinnes

Using

HDBSCAN(
           min_cluster_size=150,
           min_samples=100,
           allow_single_cluster=True,
           cluster_selection_method="leaf",
           cluster_selection_epsilon=0.1385159128651575,
           core_dist_n_jobs=-1,
       ).fit_predict(sample)

With the file attached I get one cluster less than with the previous version (basically two clusters are now merged, although there are more differences with other files)

sample.csv

BTW, I consider the result of the previous version is better for this file

SergioG-M avatar Feb 10 '22 15:02 SergioG-M

I'll try to look into this when I get some time.

lmcinnes avatar Feb 11 '22 15:02 lmcinnes

Just to add to this, I have found that changing the scikit-learn version also changes the clustering (I'm shifting between 0.23.2 and 1.0.2)

SergioG-M avatar Feb 15 '22 09:02 SergioG-M

I have had the same problem with 0.8.28. The sci-kit learn version do not seem to have an impact as I have tried with the same version of sci-kit learn (1.0.2)

In my case, the number of clusters went from 19 (in 0.8.27) to 2 (in 0.8.28). The items in the clusters did not make as much sense anymore. The depending packages in my setup are the same (see logs below), so it should be something in the code base that changed.

Install log for version 0.8.27

Requirement already satisfied: joblib>=1.0 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (1.1.0) Requirement already satisfied: scikit-learn>=0.20 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (1.0.2) Requirement already satisfied: numpy>=1.16 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (1.22.2) Requirement already satisfied: six in c:\users\danie\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (1.16.0) Requirement already satisfied: scipy>=1.0 in c:\users\danie\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (1.8.0) Requirement already satisfied: cython>=0.27 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (0.29.28) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users...\miniconda3\envs\nap\lib\site-packages (from scikit-learn>=0.20->hdbscan==0.8.27) (3.1.0) Building wheels for collected packages: hdbscan Building wheel for hdbscan (PEP 517) ... done Created wheel for hdbscan: filename=hdbscan-0.8.27-cp38-cp38-win_amd64.whl size=599380 sha256=7de3bd7b2bbde5f42d1cffd896bc053c0b8b7c019c7ba86d406c50274364f453 Stored in directory: c:\users\danie\appdata\local\pip\cache\wheels\26\f2\c2\eab587fff76dc9ffc9a9bf3ca0e44e26d2ef6425264492df65 Successfully built hdbscan Installing collected packages: hdbscan Successfully installed hdbscan-0.8.27

Install log for 0.8.28

Collecting hdbscan==0.8.28 Using cached hdbscan-0.8.28-cp38-cp38-win_amd64.whl Requirement already satisfied: joblib>=1.0 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.28) (1.1.0) Requirement already satisfied: scikit-learn>=0.20 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.28) (1.0.2) Requirement already satisfied: scipy>=1.0 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.28) (1.8.0) Requirement already satisfied: cython>=0.27 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.28) (0.29.28) Requirement already satisfied: numpy>=1.20 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.28) (1.22.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:...\danie\miniconda3\envs\nap\lib\site-packages (from scikit-learn>=0.20->hdbscan==0.8.28) (3.1.0) Installing collected packages: hdbscan Successfully installed hdbscan-0.8.28

ResearchDaniel avatar Feb 28 '22 08:02 ResearchDaniel

@ResearchDaniel I met the same problem with you. Do you know what the problem is? I fell back to version 0.8.27 and it was incompatible with the Numpy version.

chengkong2work avatar Mar 05 '22 07:03 chengkong2work

@ResearchDaniel I met the same problem with you. Do you know what the problem is? I fell back to version 0.8.27 and it was incompatible with the Numpy version.

Unfortunately, I do not know the cause of the problem. I had no issues with the numpy version when switching to 0.8.27 so it probably is due to your specific setup

ResearchDaniel avatar Mar 05 '22 12:03 ResearchDaniel

It is highly likely that the cause of the change is clustering result is related to this patch which fixed a small but long standing bug to bring certain Boruvka algorithms in line with the standard algorithm. If that is the case it means that the "desired" clustering is the result of a small bug. Since fixing that bug ensures consistent results among all the different internal algorithms (and the reference implementation) I don't foresee reverting it.

lmcinnes avatar Mar 08 '22 13:03 lmcinnes

Thanks for looking in to it! I have done some comparisons using leaf clusters (which is what we mainly are using) and there the results are almost the same (a difference of 1 cluster or so, and the clusters that are extracted seem to be basically identical).

The issue can be closed from my part, but I understand that this can be confusing for other users who upgrade and use the default excess of mass (eom) method.

ResearchDaniel avatar Mar 13 '22 08:03 ResearchDaniel