hdbscan
hdbscan copied to clipboard
Different exemplars from same clusters with Numpy 2 on different platforms
What
I found that when I upgrade from numpy 1 to 2, the clustering results are different on different platforms. This behavior didn't happen on numpy 1. I also tested setting numpy seeds and PYTHONHASHSEED
and neither helped.
How to reproduce
poetry dependency:
# poetry.toml
[tool.poetry.dependencies]
python = "^3.12"
pandas = "^2.2.2"
numpy = "^1.26.4"
hdbscan = ">=0.8.38"
scikit-learn = "^1.5.1"
the issue happened when I upgraded from numpy 1.26.4
to numpy 2.1.1
and keeping all other packages the same.
You can reproduce it with this data by reading it into a dataframe then run HDBSCAN.fit(df)
and setting cluster_selection_epsilon = 0.15
+ the parameters in the json file.
The platform name is printed with platform.platform()
- On
Linux-6.5.11-linuxkit-x86_64-with-glibc2.36
the exemplars for cluster 4 has 10 items (this is running on Apple M2) - On
Linux-5.10.223-212.873.amzn2.x86_64-x86_64-with-glibc2.36
the exemplars for cluster 4 has only 5 items (this is running on one of the AWS machines, but seems to happen on all EC2 instances we have)
Both returned the same clusters -- only the exemplars are different. Also on numpy ` they returned the same exemplars.