Difference in clustering 0.8.27 vs 0.8.28
I have noticed a difference in clustering between the new 0.8.28 vs the 0.8.27, I ran the same code on seven files, and the results are different (sometimes quite different). Is there any default parameter that has changed in the new version? I can't find any documentation of the changes. If not, has anything changed that can explain this?
Nothing obvious leaps to mind. Can you give a small test case or specific example?
Using
HDBSCAN(
min_cluster_size=150,
min_samples=100,
allow_single_cluster=True,
cluster_selection_method="leaf",
cluster_selection_epsilon=0.1385159128651575,
core_dist_n_jobs=-1,
).fit_predict(sample)
With the file attached I get one cluster less than with the previous version (basically two clusters are now merged, although there are more differences with other files)
BTW, I consider the result of the previous version is better for this file
I'll try to look into this when I get some time.
Just to add to this, I have found that changing the scikit-learn version also changes the clustering (I'm shifting between 0.23.2 and 1.0.2)
I have had the same problem with 0.8.28. The sci-kit learn version do not seem to have an impact as I have tried with the same version of sci-kit learn (1.0.2)
In my case, the number of clusters went from 19 (in 0.8.27) to 2 (in 0.8.28). The items in the clusters did not make as much sense anymore. The depending packages in my setup are the same (see logs below), so it should be something in the code base that changed.
Install log for version 0.8.27
Requirement already satisfied: joblib>=1.0 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (1.1.0) Requirement already satisfied: scikit-learn>=0.20 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (1.0.2) Requirement already satisfied: numpy>=1.16 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (1.22.2) Requirement already satisfied: six in c:\users\danie\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (1.16.0) Requirement already satisfied: scipy>=1.0 in c:\users\danie\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (1.8.0) Requirement already satisfied: cython>=0.27 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.27) (0.29.28) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users...\miniconda3\envs\nap\lib\site-packages (from scikit-learn>=0.20->hdbscan==0.8.27) (3.1.0) Building wheels for collected packages: hdbscan Building wheel for hdbscan (PEP 517) ... done Created wheel for hdbscan: filename=hdbscan-0.8.27-cp38-cp38-win_amd64.whl size=599380 sha256=7de3bd7b2bbde5f42d1cffd896bc053c0b8b7c019c7ba86d406c50274364f453 Stored in directory: c:\users\danie\appdata\local\pip\cache\wheels\26\f2\c2\eab587fff76dc9ffc9a9bf3ca0e44e26d2ef6425264492df65 Successfully built hdbscan Installing collected packages: hdbscan Successfully installed hdbscan-0.8.27
Install log for 0.8.28
Collecting hdbscan==0.8.28 Using cached hdbscan-0.8.28-cp38-cp38-win_amd64.whl Requirement already satisfied: joblib>=1.0 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.28) (1.1.0) Requirement already satisfied: scikit-learn>=0.20 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.28) (1.0.2) Requirement already satisfied: scipy>=1.0 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.28) (1.8.0) Requirement already satisfied: cython>=0.27 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.28) (0.29.28) Requirement already satisfied: numpy>=1.20 in c:\users...\miniconda3\envs\nap\lib\site-packages (from hdbscan==0.8.28) (1.22.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:...\danie\miniconda3\envs\nap\lib\site-packages (from scikit-learn>=0.20->hdbscan==0.8.28) (3.1.0) Installing collected packages: hdbscan Successfully installed hdbscan-0.8.28
@ResearchDaniel I met the same problem with you. Do you know what the problem is? I fell back to version 0.8.27 and it was incompatible with the Numpy version.
@ResearchDaniel I met the same problem with you. Do you know what the problem is? I fell back to version 0.8.27 and it was incompatible with the Numpy version.
Unfortunately, I do not know the cause of the problem. I had no issues with the numpy version when switching to 0.8.27 so it probably is due to your specific setup
It is highly likely that the cause of the change is clustering result is related to this patch which fixed a small but long standing bug to bring certain Boruvka algorithms in line with the standard algorithm. If that is the case it means that the "desired" clustering is the result of a small bug. Since fixing that bug ensures consistent results among all the different internal algorithms (and the reference implementation) I don't foresee reverting it.
Thanks for looking in to it! I have done some comparisons using leaf clusters (which is what we mainly are using) and there the results are almost the same (a difference of 1 cluster or so, and the clusters that are extracted seem to be basically identical).
The issue can be closed from my part, but I understand that this can be confusing for other users who upgrade and use the default excess of mass (eom) method.