hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Different exemplars from same clusters with Numpy 2 on different platforms

Open changhsinlee opened this issue 5 months ago • 0 comments

What

I found that when I upgrade from numpy 1 to 2, the clustering results are different on different platforms. This behavior didn't happen on numpy 1. I also tested setting numpy seeds and PYTHONHASHSEED and neither helped.

How to reproduce

poetry dependency:

# poetry.toml
[tool.poetry.dependencies]
python = "^3.12"
pandas = "^2.2.2"
numpy = "^1.26.4"
hdbscan = ">=0.8.38"
scikit-learn = "^1.5.1"

the issue happened when I upgraded from numpy 1.26.4 to numpy 2.1.1 and keeping all other packages the same.

You can reproduce it with this data by reading it into a dataframe then run HDBSCAN.fit(df) and setting cluster_selection_epsilon = 0.15 + the parameters in the json file.

data.json

The platform name is printed with platform.platform()

  • On Linux-6.5.11-linuxkit-x86_64-with-glibc2.36 the exemplars for cluster 4 has 10 items (this is running on Apple M2)
  • On Linux-5.10.223-212.873.amzn2.x86_64-x86_64-with-glibc2.36 the exemplars for cluster 4 has only 5 items (this is running on one of the AWS machines, but seems to happen on all EC2 instances we have)

Both returned the same clusters -- only the exemplars are different. Also on numpy ` they returned the same exemplars.

changhsinlee avatar Sep 04 '24 15:09 changhsinlee