Top2Vec icon indicating copy to clipboard operation
Top2Vec copied to clipboard

TypeError: 'numpy.float64' object cannot be interpreted as an integer

Open Gulfon opened this issue 1 year ago • 6 comments

Hi there,

When trying to run the example code I encounter the following:

from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)
2023-07-20 13:51:37,083 - top2vec - INFO - Pre-processing documents for training
2023-07-20 13:51:48,891 - top2vec - INFO - Creating joint document/word embedding
2023-07-20 14:01:43,811 - top2vec - INFO - Creating lower dimension embedding of documents
2023-07-20 14:02:09,146 - top2vec - INFO - Finding dense areas of documents

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/thedmitry/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.9/site-packages/top2vec/Top2Vec.py", line 666, in __init__
    self.compute_topics(umap_args=umap_args, hdbscan_args=hdbscan_args, topic_merge_delta=topic_merge_delta)
  File "/Users/thedmitry/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.9/site-packages/top2vec/Top2Vec.py", line 1266, in compute_topics
    cluster = hdbscan.HDBSCAN(**hdbscan_args).fit(umap_model.embedding_)
  File "/Users/thedmitry/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 1205, in fit
    ) = hdbscan(clean_data, **kwargs)
  File "/Users/thedmitry/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 884, in hdbscan
    _tree_to_labels(
  File "/Users/thedmitry/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 80, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(
  File "hdbscan/_hdbscan_tree.pyx", line 659, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters

TypeError: 'numpy.float64' object cannot be interpreted as an integer

All of the libraries are updated to the latest versions, but I have tried downgrading lumpy and hdbscan with no result.

I am fairly new to Python and not sure if there's something I am doing wrong here. I did see some discussion of this error on the hdbscan issues page, but their solution there was to upgrade to the most recent version, which did not help in my case.

Gulfon avatar Jul 20 '23 06:07 Gulfon

I am running into the same problem

BobTourne avatar Jul 21 '23 09:07 BobTourne

I have the same issue. All embedding models ran into this error. Using Python 3.10 right now!

sieu-tran avatar Jul 31 '23 20:07 sieu-tran

So, I switched to a different method, but encountered the same error there. I am using python 3.11, so ymmw, but what helped me was installing older versions of a couple of libraries. Not sure if the second line is required for top2vec.

%pip install --user --no-warn-script-location --disable-pip-version-check Cython==0.29.34 numpy==1.23.5 %pip install --user --no-warn-script-location --disable-pip-version-check --no-build-isolation hdbscan==0.8.29

Gulfon avatar Jul 31 '23 23:07 Gulfon

Folks, I found the problem and a "fix"! Its actually gcc and hdbscan problem which seems to be a dependency for hdbscan. The fix for me is installing VC+++ 2022 and add the C++ Desktop Development package. pip install now works for hdbscan and enables top2vec to run properly. I hope this helps!

sieu-tran avatar Aug 01 '23 14:08 sieu-tran

For me this did not work. After uninstalling hdbscan and cloning + installing manually it did work. As per https://github.com/scikit-learn-contrib/hdbscan/issues/607

jvanelteren avatar Aug 03 '23 20:08 jvanelteren

It is indeed a problem with HDBSCAN, related to this issue.

Updating HDBSCAN to 0.8.33 worked for me.

BobTourne avatar Aug 04 '23 08:08 BobTourne