BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

TypeError: 'numpy.float64' object cannot be interpreted as an integer

Open Cezary-Kuik opened this issue 2 years ago • 19 comments

Hey! I had the problem mentioned in this thread, but after the update the problem was solved. Another one appeared, I get this error:

TypeError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
   3217             try:
-> 3218                 self.hdbscan_model.fit(umap_embeddings, y=y)
   3219             except TypeError:

9 frames
hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.condense_tree()

hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.condense_tree()

TypeError: 'numpy.float64' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/hdbscan/hdbscan_.py](https://localhost:8080/#) in _tree_to_labels(X, single_linkage_tree, min_cluster_size, cluster_selection_method, allow_single_cluster, match_reference_implementation, cluster_selection_epsilon, max_cluster_size)
     76     set of labels and probabilities.
     77     """
---> 78     condensed_tree = condense_tree(single_linkage_tree, min_cluster_size)
     79     stability_dict = compute_stability(condensed_tree)
     80     labels, probabilities, stabilities = get_clusters(

hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.condense_tree()

hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.condense_tree()

TypeError: 'numpy.float64' object cannot be interpreted as an integer

I checked the data that I put into the model and there is nothing there in this format. What's more, I checked it on the file I was working on yesterday, which was reprocessed successfully. Suddenly I am getting this error on it as well. Any ideas?

Cezary-Kuik avatar Jul 17 '23 19:07 Cezary-Kuik

This is a problem with hdbscan, not BERTopic, and can be worked around with this method: https://github.com/scikit-learn-contrib/hdbscan/issues/600#issuecomment-1638837464

!pip install git+https://github.com/scikit-learn-contrib/hdbscan.git
!pip install BERTopic

jsalsman avatar Jul 17 '23 22:07 jsalsman

are you solve it? even I use !pip install git+https://github.com/scikit-learn-contrib/hdbscan.git !pip install BERTopic still not working

shasha920 avatar Jul 18 '23 05:07 shasha920

You need to make sure to start from a completely fresh environment or uninstall BERTopic first.

MaartenGr avatar Jul 18 '23 05:07 MaartenGr

You need to make sure to start from a completely fresh environment or uninstall BERTopic first.

I want to say "THANKS!!!!!" for you! solve it!!!!!!!

shasha920 avatar Jul 18 '23 06:07 shasha920

This is a problem with hdbscan, not BERTopic, and can be worked around with this method: scikit-learn-contrib/hdbscan#600 (comment)

!pip install git+https://github.com/scikit-learn-contrib/hdbscan.git
!pip install BERTopic

Thanks too!

shasha920 avatar Jul 18 '23 06:07 shasha920

Mine one still didn't work although after ! Pip those two and opened a new notebook in Colab..

Got an error mainly on the line below:

topic_model = BERTopic().fit(corpus, corpus_embeddings)

Error: TypeError: 'numpy.float64' object cannot be interpreted as an integer

ssaee79 avatar Jul 18 '23 12:07 ssaee79

@ssaee79 The following is working for me in a fresh Google Colab:

First, you install BERTopic as follows:

!pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git
!pip install --upgrade BERTopic

Then, you restart the runtime to make sure that imports are refreshed.

Finally, the following code is working for me:

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic

# Prepare embeddings
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Train our topic model using our pre-trained sentence-transformers embeddings
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)

MaartenGr avatar Jul 18 '23 13:07 MaartenGr

@ssaee79 The following is working for me in a fresh Google Colab:

First, you install BERTopic as follows:

!pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git
!pip install --upgrade BERTopic

Then, you restart the runtime to make sure that imports are refreshed.

Finally, the following code is working for me:

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic

# Prepare embeddings
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Train our topic model using our pre-trained sentence-transformers embeddings
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)

It works for me :) Thank you so much!

ssaee79 avatar Jul 18 '23 14:07 ssaee79

There is a new release of hdbscan on PyPI that will hopefully fix this now.

lmcinnes avatar Jul 19 '23 02:07 lmcinnes

👌 I can confirm that thanks to the (new) release of hdbscan v0.8.33, BERTopic install and execution both work fine

Kcnarf avatar Jul 19 '23 07:07 Kcnarf

@ssaee79 The following is working for me in a fresh Google Colab:

First, you install BERTopic as follows:

!pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git
!pip install --upgrade BERTopic

Then, you restart the runtime to make sure that imports are refreshed.

Finally, the following code is working for me:

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic

# Prepare embeddings
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Train our topic model using our pre-trained sentence-transformers embeddings
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)

Hey!, I am still getting the same error, The same code works in Colab, but when I try it in Jupyter it does not. I have tried to install git+https://github.com/scikit-learn-contrib/hdbscan.git.

I am new to this. Do let me know if I might be doing something wrong. Thanks!

Rishi-Prakash-TS avatar Aug 07 '23 08:08 Rishi-Prakash-TS

@Rishi-Prakash-TS Did you try from a completely new environment? It often helps to start fresh and then do the installation of packages.

MaartenGr avatar Aug 07 '23 12:08 MaartenGr

Hi @MaartenGr, the issue got resolved. The issue was mainly due to error Building wheels for collected packages: hdbscan. Found the solution from your other replies. Thank you!

Rishi-Prakash-TS avatar Aug 10 '23 11:08 Rishi-Prakash-TS

I am experiencing the same issue now. I have tried the solutions listed here, but none have worked so far. I am using jupyter notebook for reference. Here's the full output:


`2023-08-17 15:23:13,260 - BERTopic - Transformed documents to Embeddings
2023-08-17 15:23:35,922 - BERTopic - Reduced dimensionality
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~\anaconda3\Lib\site-packages\bertopic\_bertopic.py:3218, in BERTopic._cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
   3217 try:
-> 3218     self.hdbscan_model.fit(umap_embeddings, y=y)
   3219 except TypeError:

File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:1205, in HDBSCAN.fit(self, X, y)
   1196 kwargs.update(self._metric_kwargs)
   1198 (
   1199     self.labels_,
   1200     self.probabilities_,
   1201     self.cluster_persistence_,
   1202     self._condensed_tree,
   1203     self._single_linkage_tree,
   1204     self._min_spanning_tree,
-> 1205 ) = hdbscan(clean_data, **kwargs)
   1207 if self.metric != "precomputed" and not self._all_finite:
   1208     # remap indices to align with original data in the case of non-finite entries.

File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:884, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    868             (single_linkage_tree, result_min_span_tree) = memory.cache(
    869                 _hdbscan_boruvka_balltree
    870             )(
   (...)
    880                 **kwargs
    881             )
    883 return (
--> 884     _tree_to_labels(
    885         X,
    886         single_linkage_tree,
    887         min_cluster_size,
    888         cluster_selection_method,
    889         allow_single_cluster,
    890         match_reference_implementation,
    891         cluster_selection_epsilon,
    892         max_cluster_size,
    893     )
    894     + (result_min_span_tree,)
    895 )

File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:80, in _tree_to_labels(X, single_linkage_tree, min_cluster_size, cluster_selection_method, allow_single_cluster, match_reference_implementation, cluster_selection_epsilon, max_cluster_size)
     79 stability_dict = compute_stability(condensed_tree)
---> 80 labels, probabilities, stabilities = get_clusters(
     81     condensed_tree,
     82     stability_dict,
     83     cluster_selection_method,
     84     allow_single_cluster,
     85     match_reference_implementation,
     86     cluster_selection_epsilon,
     87     max_cluster_size,
     88 )
     90 return (labels, probabilities, stabilities, condensed_tree, single_linkage_tree)

File hdbscan\\_hdbscan_tree.pyx:659, in hdbscan._hdbscan_tree.get_clusters()

File hdbscan\\_hdbscan_tree.pyx:733, in hdbscan._hdbscan_tree.get_clusters()

TypeError: 'numpy.float64' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[4], line 7
      5 docs=df.text.tolist()
      6 docs
----> 7 topics, probabilities = model.fit_transform(docs)

File ~\anaconda3\Lib\site-packages\bertopic\_bertopic.py:389, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    386 umap_embeddings = self._reduce_dimensionality(embeddings, y)
    388 # Cluster reduced embeddings
--> 389 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
    391 # Sort and Map Topic IDs by their frequency
    392 if not self.nr_topics:

File ~\anaconda3\Lib\site-packages\bertopic\_bertopic.py:3220, in BERTopic._cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
   3218     self.hdbscan_model.fit(umap_embeddings, y=y)
   3219 except TypeError:
-> 3220     self.hdbscan_model.fit(umap_embeddings)
   3222 try:
   3223     labels = self.hdbscan_model.labels_

File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:1205, in HDBSCAN.fit(self, X, y)
   1195 kwargs.pop("prediction_data", None)
   1196 kwargs.update(self._metric_kwargs)
   1198 (
   1199     self.labels_,
   1200     self.probabilities_,
   1201     self.cluster_persistence_,
   1202     self._condensed_tree,
   1203     self._single_linkage_tree,
   1204     self._min_spanning_tree,
-> 1205 ) = hdbscan(clean_data, **kwargs)
   1207 if self.metric != "precomputed" and not self._all_finite:
   1208     # remap indices to align with original data in the case of non-finite entries.
   1209     self._condensed_tree = remap_condensed_tree(
   1210         self._condensed_tree, internal_to_raw, outliers
   1211     )

File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:884, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    867         else:
    868             (single_linkage_tree, result_min_span_tree) = memory.cache(
    869                 _hdbscan_boruvka_balltree
    870             )(
   (...)
    880                 **kwargs
    881             )
    883 return (
--> 884     _tree_to_labels(
    885         X,
    886         single_linkage_tree,
    887         min_cluster_size,
    888         cluster_selection_method,
    889         allow_single_cluster,
    890         match_reference_implementation,
    891         cluster_selection_epsilon,
    892         max_cluster_size,
    893     )
    894     + (result_min_span_tree,)
    895 )

File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:80, in _tree_to_labels(X, single_linkage_tree, min_cluster_size, cluster_selection_method, allow_single_cluster, match_reference_implementation, cluster_selection_epsilon, max_cluster_size)
     78 condensed_tree = condense_tree(single_linkage_tree, min_cluster_size)
     79 stability_dict = compute_stability(condensed_tree)
---> 80 labels, probabilities, stabilities = get_clusters(
     81     condensed_tree,
     82     stability_dict,
     83     cluster_selection_method,
     84     allow_single_cluster,
     85     match_reference_implementation,
     86     cluster_selection_epsilon,
     87     max_cluster_size,
     88 )
     90 return (labels, probabilities, stabilities, condensed_tree, single_linkage_tree)

File hdbscan\\_hdbscan_tree.pyx:659, in hdbscan._hdbscan_tree.get_clusters()

File hdbscan\\_hdbscan_tree.pyx:733, in hdbscan._hdbscan_tree.get_clusters()

TypeError: 'numpy.float64' object cannot be interpreted as an integer`

bray2016 avatar Aug 17 '23 20:08 bray2016

Hi @MaartenGr, the issue got resolved. The issue was mainly due to error Building wheels for collected packages: hdbscan. Found the solution from your other replies. Thank you!

hi, how did you resolve it? i've been having the same issue and now created multiple environments but no luck

firoznamaji avatar Aug 19 '23 21:08 firoznamaji

Hi @firoznamaji, I solved it by downloading C++. If you're on Windows, you might need Microsoft Visual C++ Build Tools. You can download them from the official Microsoft website. This is a common requirement for building Python packages from source.

image

Rishi-Prakash-TS avatar Aug 21 '23 06:08 Rishi-Prakash-TS

@bray2016 Check if you are getting an error Building wheels for collected packages: hdbscan while installing BERTopic. If so try following my previous reply of downloading C++

Rishi-Prakash-TS avatar Aug 21 '23 06:08 Rishi-Prakash-TS

@bray2016 Check if you are getting an error Building wheels for collected packages: hdbscan while installing BERTopic. If so try following my previous reply of downloading C++

It's working now, but I'm not entirely sure what fixed it. I already had MV C++ installed. I had been creating new environments, restarting my computer, etc. I assume one of the fixes here worked, so thank you! Sorry I can't point to one in particular.

bray2016 avatar Aug 21 '23 19:08 bray2016

这是 hdbscan 的问题,而不是 BERTopic,可以使用此方法解决:scikit-learn-contrib/hdbscan#600(评论)

thank u so much!

patrickstarhfg avatar Jan 30 '24 13:01 patrickstarhfg