hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

pypi version throws ValueError

Open FinnHuelsbusch opened this issue 2 years ago • 27 comments

To reproduce the bug:

  1. Create a new python 3.11.x environment (tested with python 3.11.4)
  2. install the following dependencies:
  • scipy 1.11.1
  • scikit-learn 1.3.0
  • cython 0.29.36
  • hdbscan 0.8.33
  1. create a minimal example:
from sklearn.datasets import make_blobs
import hdbscan
blobs, labels = make_blobs(n_samples=2000, n_features=10)
clusterer = hdbscan.HDBSCAN()
clusterer.fit(blobs)
print(clusterer.labels_)
  1. Execute it and get the following error:
Traceback (most recent call last):
File "/home/***/Desktop/hdbscan_test.py", line 5, in <module>
    clusterer.fit(blobs)
  File "/home/***/micromamba/envs/hdbscan3/lib/python3.11/site-packages/hdbscan/hdbscan_.py", line 1205, in fit
    ) = hdbscan(clean_data, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/micromamba/envs/hdbscan3/lib/python3.11/site-packages/hdbscan/hdbscan_.py", line 884, in hdbscan
    _tree_to_labels(
  File "/home/***/micromamba/envs/hdbscan3/lib/python3.11/site-packages/hdbscan/hdbscan_.py", line 80, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(
                                         ^^^^^^^^^^^^^
  File "hdbscan/_hdbscan_tree.pyx", line 659, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
TypeError: 'numpy.float64' object cannot be interpreted as an integer

Workaround:

  1. Clone the repo
  2. uninstall hdbscan from the environment
  3. execute python setup.py install while the environment is active
  4. Execute the minimal example again.
  5. It work's

This was also tested with the commit 813636b2eda63739c9fc081f2ef78ad4c98444a1 (The commit of version 0.8.33)

Would be nice to get instructions on how to fix this (if the error is on my side) or to fix this in general.

Tested on Windows and Linux. This error only occurs under python 3.11.x.

FinnHuelsbusch avatar Aug 01 '23 10:08 FinnHuelsbusch

The error message seems similar to an error mentioned in #600 in the comments and its fix in #602. Though both are talking about the condense_tree function.

FinnHuelsbusch avatar Aug 01 '23 11:08 FinnHuelsbusch

I have the same error, both the 0.8.29 and 0.8.33

empowerVictor avatar Aug 02 '23 12:08 empowerVictor

Absolutely, my version of python is also 3.11.x. I have the same error, but after I try this method, I get anthor error ModuleNotFoundError: No module named 'hdbscan._hdbscan_linkage'

Try python setup.py develop to replace python setup.py install I solve this problem.

LoveFishoO avatar Aug 08 '23 07:08 LoveFishoO

Maybe #606 helps with this error.

FinnHuelsbusch avatar Aug 08 '23 09:08 FinnHuelsbusch

I also replicated the bug on Windows. Packages installed with pypi. Base virtual environment created with miniconda.

Bug occurs:

  • Python 3.11.x
  • scikit-learn 1.3.0
  • hdbscan 0.8.33
  • numpy 1.24.4
from sklearn.datasets import make_blobs
import hdbscan
blobs, labels = make_blobs(n_samples=2000, n_features=10)
clusterer = hdbscan.HDBSCAN()
clusterer.fit(blobs)
print(clusterer.labels_)

Error:

File hdbscan\\_hdbscan_tree.pyx:733, in hdbscan._hdbscan_tree.get_clusters()

TypeError: 'numpy.float64' object cannot be interpreted as an integer

Avoid the bug by switching to slower Python 3.10.x and downgrading scikit-learn. Keep the hdbscan and numpy versions.

No errors:

  • Python 3.10.x
  • scikit-learn 1.21.1
  • hdbscan 0.8.33
  • numpy 1.24.4

Revised 15 August, 2023

jkmackie avatar Aug 10 '23 03:08 jkmackie

I am also getting this error on windows builds. This seems like a pretty urgent issue. @lmcinnes or @gclendenning, forgive the @, but you may want to take a look at this.

RichieHakim avatar Aug 14 '23 23:08 RichieHakim

So this line: https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/_hdbscan_tree.pyx#L733

is_cluster = {cluster: True for cluster in node_list}

node_list is constructed above:

    if allow_single_cluster:
        node_list = sorted(stability.keys(), reverse=True)
    else:
        node_list = sorted(stability.keys(), reverse=True)[:-1]
        # (exclude root)

and stability is from https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/_hdbscan_tree.pyx#L164, see return https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/_hdbscan_tree.pyx#L237-L241

    result_pre_dict = np.vstack((np.arange(smallest_cluster,
                                           condensed_tree['parent'].max() + 1),
                                 result_arr)).T

    return dict(result_pre_dict)

np.arange should have an integer dtype I think; result_arr has type dtype=np.double.

I am not sure if the np.vstack might be casting the the integer keys to floats due to the result_arr type (I might check this later), can't see anything obvious in numpy which would have changed this behaviour

johnlees avatar Aug 16 '23 08:08 johnlees

@jkmackie thanks for the solution mate! appreciate it.

JanElbertMDavid avatar Aug 16 '23 11:08 JanElbertMDavid

At least some of the issues seem to be related to the wheel built for windows (and python 3.11). I have deleted that from PyPI. The downside is that installing on windows will require you to build from source; the upside is that hopefully installing from PyPI might work now.

lmcinnes avatar Aug 16 '23 15:08 lmcinnes

Just to confirm, I am also seeing this on an Ubuntu 22.04 CI with:

  • hdbscan 0.8.33
  • python 3.10.12
  • scikit-learn 1.3.0
  • numpy 1.22.4

johnlees avatar Aug 16 '23 15:08 johnlees

b .../lib/python3.10/site-packages/hdbscan/hdbscan_.py:80
p stability_dict.keys()
dict_keys([378.0, 379.0, 380.0, 381.0, 382.0, 383.0, 384.0, 385.0, 386.0, 387.0, 388.0, 389.0, 390.0, 391.0, 392.0, 393.0, 394.0])

not sure if those being floats is the problem here

johnlees avatar Aug 16 '23 16:08 johnlees

@johnlees I suspect downgrading scikit-learn below 1.3 would fix on Ubuntu. Numpy 1.22.4 is used in the successful Windows configuration below:

#Successful configuration - Windows 10.

(myvirtualenv) 
me@mypc MINGW64 ~/embedding_clustering
$ conda list | grep -w '^python\s\|scikit\|hdbscan\|numpy'
hdbscan                   0.8.33                   pypi_0    pypi
numpy                     1.24.4                   pypi_0    pypi
python                    3.10.9          h4de0772_0_cpython    conda-forge
scikit-learn              1.2.1                    pypi_0    pypi

Note hdbscan is imported separately from scikit-learn. I wonder why it isn't imported as a module like KMeans?

#from package.subpackage import module
from sklearn.cluster import KMeans

#in contrast, hdbscan cluster algo is imported directly
import hdbscan

jkmackie avatar Aug 16 '23 16:08 jkmackie

Same issue with scikit-learn 1.2.2 and 1.2.1, and other packages as above. I'm guessing this is a cython issue with the pyx files?

johnlees avatar Aug 16 '23 16:08 johnlees

This is really quirky, and I am having a great deal of trouble reproducing it in a way that I can actually debug it myself.

lmcinnes avatar Aug 16 '23 19:08 lmcinnes

Removing the pre-built wheel for windows on pypi was sufficient to get it working on my github actions windows runners.

If it is helpful, here is an example of when it was failing: https://github.com/RichieHakim/ROICaT/actions/runs/5861440405/job/15891513454

Thank you for the quick fix.

RichieHakim avatar Aug 16 '23 19:08 RichieHakim

Removing the pre-built wheels and building from source didn't solve the bug for me

alxfgh avatar Aug 16 '23 20:08 alxfgh

Removing the pre-built wheels and building from source didn't solve the bug for me

Did you try a fresh environment?

conda create -n testenv python=3.11

pip install hdbscan==0.8.33 numpy==1.24.4 notebook==7.0.2 scikit-learn==1.3.0

Cython should be something like 0.29.26 not 3.0.

If there's a hdbscan error, try:

pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan

jkmackie avatar Aug 16 '23 23:08 jkmackie

This is really quirky, and I am having a great deal of trouble reproducing it in a way that I can actually debug it myself.

Likewise – doing the install from source (rebuilding the cython generated .so libraries) makes the issue go away. I have floats in the line reported by the backtrace, and am not sure that's the correct erroring line anyway. I might try rebuilding the conda-forge version and see if that helps

johnlees avatar Aug 22 '23 16:08 johnlees

We have a new azure-pipelines CI system that will automatically build wheels and publish them to PyPI thanks to @gclendenning, so hopefully the next time we make a release this will all work a little better. It is definitely just quirks on exactly how things build on different platforms etc. but the fine details of that are ... hard to sort out.

lmcinnes avatar Aug 22 '23 17:08 lmcinnes

Ah maybe I should have been clearer, I am having issues with the conda version, not pypi. The rebuild on conda-forge didn't sort out the CI issue unfortunately, still the same error.

johnlees avatar Aug 23 '23 08:08 johnlees

The conda forge recipe might need to be changed. Potentially adding a version restriction to Cython in the recipe itself (since it may not use the build isolation that pip install does) might help.

lmcinnes avatar Aug 23 '23 14:08 lmcinnes

The conda forge recipe might need to be changed. Potentially adding a version restriction to Cython in the recipe itself (since it may not use the build isolation that pip install does) might help.

Thanks for the pointer, this seems to have fixed it! Looks like we can have cython<3 when built but free version at run time and it works. I also added a run test to the recipe which I hope would flag such an issue in future releases

johnlees avatar Aug 24 '23 10:08 johnlees

Hi all, having trouble understanding what to do here (I installed HDBSCAN 2 days ago through Conda and I'm currently experiencing this issue). Can I remove and reinstall HDBSCAN through Conda at this point to solve the problem? If so, do I also need to remove and reinstall anything else? Cython? Thank you.

Gr4dient avatar Aug 25 '23 04:08 Gr4dient

@Gr4dient I would reinstall your HDBSCAN in that environment, or even just try a fresh conda environment. I hope to have fixed it in 0.8.33_3 releases (when you do conda list the hdbscan version should end in _3)

johnlees avatar Aug 25 '23 07:08 johnlees

Hi John, thanks for clarifying - it took several hours for Conda to find a solution to remove Cython and HDBSCAN from my NLP environment last night... not sure why it got so hung up. I'm not seeing '_3' on conda-forge; will that be available at some point soon? Thanks

Gr4dient avatar Aug 25 '23 18:08 Gr4dient

The new builds are on conda forge, e.g. in my working environment conda list shows:

hdbscan                    0.8.33        py310h1f7b6fc_3          conda-forge

If you are having trouble with time taken to resolve environments I would recommend using mamba instead of conda, or just starting over with a new environment, or both.

johnlees avatar Aug 29 '23 08:08 johnlees

I can also reproduce this with a from-source build on Fedora 39:

# dnf install python3-devel python3-Cython python3-numpy python3-scipy python3-scikit-learn python3-setuptools gcc
# curl -LO https://files.pythonhosted.org/packages/44/2c/b6bb84999f1c82cf0abd28595ff8aff2e495e18f8718b6b18bb11a012de4/hdbscan-0.8.33.tar.gz
# tar -xvzf hdbscan-0.8.33.tar.gz 
# (cd hdbscan-0.8.33 && python3 setup.py build -j8)
# cat <<END > test.py
import hdbscan
from sklearn.datasets import make_blobs
data, _ = make_blobs(1000)
clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)
assert len(cluster_labels) == 1000
END
# PYTHONPATH=hdbscan-0.8.33/build/lib.linux-x86_64-cpython-312/ python3 test.py
...
  File "//hdbscan-0.8.33/build/lib.linux-x86_64-cpython-312/hdbscan/hdbscan_.py", line 80, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(
                                         ^^^^^^^^^^^^^
  File "hdbscan/_hdbscan_tree.pyx", line 659, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
TypeError: 'numpy.float64' object cannot be interpreted as an integer

A hacky fix which works for me is to replace https://github.com/scikit-learn-contrib/hdbscan/blob/0.8.33/hdbscan/_hdbscan_tree.pyx#L726-L729 with

    if allow_single_cluster:
        node_list = sorted([int(x) for x in stability.keys()], reverse=True)
    else:
        node_list = sorted([int(x) for x in stability.keys()], reverse=True)[:-1]

benmwebb avatar Nov 14 '23 06:11 benmwebb