BERTopic
BERTopic copied to clipboard
TypeError: 'numpy.float64' object cannot be interpreted as an integer
Hey! I had the problem mentioned in this thread, but after the update the problem was solved. Another one appeared, I get this error:
TypeError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
3217 try:
-> 3218 self.hdbscan_model.fit(umap_embeddings, y=y)
3219 except TypeError:
9 frames
hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.condense_tree()
hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.condense_tree()
TypeError: 'numpy.float64' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/hdbscan/hdbscan_.py](https://localhost:8080/#) in _tree_to_labels(X, single_linkage_tree, min_cluster_size, cluster_selection_method, allow_single_cluster, match_reference_implementation, cluster_selection_epsilon, max_cluster_size)
76 set of labels and probabilities.
77 """
---> 78 condensed_tree = condense_tree(single_linkage_tree, min_cluster_size)
79 stability_dict = compute_stability(condensed_tree)
80 labels, probabilities, stabilities = get_clusters(
hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.condense_tree()
hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.condense_tree()
TypeError: 'numpy.float64' object cannot be interpreted as an integer
I checked the data that I put into the model and there is nothing there in this format. What's more, I checked it on the file I was working on yesterday, which was reprocessed successfully. Suddenly I am getting this error on it as well. Any ideas?
This is a problem with hdbscan, not BERTopic, and can be worked around with this method: https://github.com/scikit-learn-contrib/hdbscan/issues/600#issuecomment-1638837464
!pip install git+https://github.com/scikit-learn-contrib/hdbscan.git
!pip install BERTopic
are you solve it? even I use !pip install git+https://github.com/scikit-learn-contrib/hdbscan.git !pip install BERTopic still not working
You need to make sure to start from a completely fresh environment or uninstall BERTopic first.
You need to make sure to start from a completely fresh environment or uninstall BERTopic first.
I want to say "THANKS!!!!!" for you! solve it!!!!!!!
This is a problem with hdbscan, not BERTopic, and can be worked around with this method: scikit-learn-contrib/hdbscan#600 (comment)
!pip install git+https://github.com/scikit-learn-contrib/hdbscan.git !pip install BERTopic
Thanks too!
Mine one still didn't work although after ! Pip those two and opened a new notebook in Colab..
Got an error mainly on the line below:
topic_model = BERTopic().fit(corpus, corpus_embeddings)
Error: TypeError: 'numpy.float64' object cannot be interpreted as an integer
@ssaee79 The following is working for me in a fresh Google Colab:
First, you install BERTopic as follows:
!pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git
!pip install --upgrade BERTopic
Then, you restart the runtime to make sure that imports are refreshed.
Finally, the following code is working for me:
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
# Prepare embeddings
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)
# Train our topic model using our pre-trained sentence-transformers embeddings
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)
@ssaee79 The following is working for me in a fresh Google Colab:
First, you install BERTopic as follows:
!pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git !pip install --upgrade BERTopicThen, you restart the runtime to make sure that imports are refreshed.
Finally, the following code is working for me:
from sklearn.datasets import fetch_20newsgroups from sentence_transformers import SentenceTransformer from bertopic import BERTopic # Prepare embeddings docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data'] sentence_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = sentence_model.encode(docs, show_progress_bar=True) # Train our topic model using our pre-trained sentence-transformers embeddings topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs, embeddings)
It works for me :) Thank you so much!
There is a new release of hdbscan on PyPI that will hopefully fix this now.
👌 I can confirm that thanks to the (new) release of hdbscan v0.8.33, BERTopic install and execution both work fine
@ssaee79 The following is working for me in a fresh Google Colab:
First, you install BERTopic as follows:
!pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git !pip install --upgrade BERTopicThen, you restart the runtime to make sure that imports are refreshed.
Finally, the following code is working for me:
from sklearn.datasets import fetch_20newsgroups from sentence_transformers import SentenceTransformer from bertopic import BERTopic # Prepare embeddings docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data'] sentence_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = sentence_model.encode(docs, show_progress_bar=True) # Train our topic model using our pre-trained sentence-transformers embeddings topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs, embeddings)
Hey!, I am still getting the same error, The same code works in Colab, but when I try it in Jupyter it does not. I have tried to install git+https://github.com/scikit-learn-contrib/hdbscan.git.
I am new to this. Do let me know if I might be doing something wrong. Thanks!
@Rishi-Prakash-TS Did you try from a completely new environment? It often helps to start fresh and then do the installation of packages.
Hi @MaartenGr, the issue got resolved. The issue was mainly due to error Building wheels for collected packages: hdbscan. Found the solution from your other replies. Thank you!
I am experiencing the same issue now. I have tried the solutions listed here, but none have worked so far. I am using jupyter notebook for reference. Here's the full output:
`2023-08-17 15:23:13,260 - BERTopic - Transformed documents to Embeddings
2023-08-17 15:23:35,922 - BERTopic - Reduced dimensionality
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~\anaconda3\Lib\site-packages\bertopic\_bertopic.py:3218, in BERTopic._cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
3217 try:
-> 3218 self.hdbscan_model.fit(umap_embeddings, y=y)
3219 except TypeError:
File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:1205, in HDBSCAN.fit(self, X, y)
1196 kwargs.update(self._metric_kwargs)
1198 (
1199 self.labels_,
1200 self.probabilities_,
1201 self.cluster_persistence_,
1202 self._condensed_tree,
1203 self._single_linkage_tree,
1204 self._min_spanning_tree,
-> 1205 ) = hdbscan(clean_data, **kwargs)
1207 if self.metric != "precomputed" and not self._all_finite:
1208 # remap indices to align with original data in the case of non-finite entries.
File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:884, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
868 (single_linkage_tree, result_min_span_tree) = memory.cache(
869 _hdbscan_boruvka_balltree
870 )(
(...)
880 **kwargs
881 )
883 return (
--> 884 _tree_to_labels(
885 X,
886 single_linkage_tree,
887 min_cluster_size,
888 cluster_selection_method,
889 allow_single_cluster,
890 match_reference_implementation,
891 cluster_selection_epsilon,
892 max_cluster_size,
893 )
894 + (result_min_span_tree,)
895 )
File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:80, in _tree_to_labels(X, single_linkage_tree, min_cluster_size, cluster_selection_method, allow_single_cluster, match_reference_implementation, cluster_selection_epsilon, max_cluster_size)
79 stability_dict = compute_stability(condensed_tree)
---> 80 labels, probabilities, stabilities = get_clusters(
81 condensed_tree,
82 stability_dict,
83 cluster_selection_method,
84 allow_single_cluster,
85 match_reference_implementation,
86 cluster_selection_epsilon,
87 max_cluster_size,
88 )
90 return (labels, probabilities, stabilities, condensed_tree, single_linkage_tree)
File hdbscan\\_hdbscan_tree.pyx:659, in hdbscan._hdbscan_tree.get_clusters()
File hdbscan\\_hdbscan_tree.pyx:733, in hdbscan._hdbscan_tree.get_clusters()
TypeError: 'numpy.float64' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
Cell In[4], line 7
5 docs=df.text.tolist()
6 docs
----> 7 topics, probabilities = model.fit_transform(docs)
File ~\anaconda3\Lib\site-packages\bertopic\_bertopic.py:389, in BERTopic.fit_transform(self, documents, embeddings, images, y)
386 umap_embeddings = self._reduce_dimensionality(embeddings, y)
388 # Cluster reduced embeddings
--> 389 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
391 # Sort and Map Topic IDs by their frequency
392 if not self.nr_topics:
File ~\anaconda3\Lib\site-packages\bertopic\_bertopic.py:3220, in BERTopic._cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
3218 self.hdbscan_model.fit(umap_embeddings, y=y)
3219 except TypeError:
-> 3220 self.hdbscan_model.fit(umap_embeddings)
3222 try:
3223 labels = self.hdbscan_model.labels_
File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:1205, in HDBSCAN.fit(self, X, y)
1195 kwargs.pop("prediction_data", None)
1196 kwargs.update(self._metric_kwargs)
1198 (
1199 self.labels_,
1200 self.probabilities_,
1201 self.cluster_persistence_,
1202 self._condensed_tree,
1203 self._single_linkage_tree,
1204 self._min_spanning_tree,
-> 1205 ) = hdbscan(clean_data, **kwargs)
1207 if self.metric != "precomputed" and not self._all_finite:
1208 # remap indices to align with original data in the case of non-finite entries.
1209 self._condensed_tree = remap_condensed_tree(
1210 self._condensed_tree, internal_to_raw, outliers
1211 )
File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:884, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
867 else:
868 (single_linkage_tree, result_min_span_tree) = memory.cache(
869 _hdbscan_boruvka_balltree
870 )(
(...)
880 **kwargs
881 )
883 return (
--> 884 _tree_to_labels(
885 X,
886 single_linkage_tree,
887 min_cluster_size,
888 cluster_selection_method,
889 allow_single_cluster,
890 match_reference_implementation,
891 cluster_selection_epsilon,
892 max_cluster_size,
893 )
894 + (result_min_span_tree,)
895 )
File ~\anaconda3\Lib\site-packages\hdbscan\hdbscan_.py:80, in _tree_to_labels(X, single_linkage_tree, min_cluster_size, cluster_selection_method, allow_single_cluster, match_reference_implementation, cluster_selection_epsilon, max_cluster_size)
78 condensed_tree = condense_tree(single_linkage_tree, min_cluster_size)
79 stability_dict = compute_stability(condensed_tree)
---> 80 labels, probabilities, stabilities = get_clusters(
81 condensed_tree,
82 stability_dict,
83 cluster_selection_method,
84 allow_single_cluster,
85 match_reference_implementation,
86 cluster_selection_epsilon,
87 max_cluster_size,
88 )
90 return (labels, probabilities, stabilities, condensed_tree, single_linkage_tree)
File hdbscan\\_hdbscan_tree.pyx:659, in hdbscan._hdbscan_tree.get_clusters()
File hdbscan\\_hdbscan_tree.pyx:733, in hdbscan._hdbscan_tree.get_clusters()
TypeError: 'numpy.float64' object cannot be interpreted as an integer`
Hi @MaartenGr, the issue got resolved. The issue was mainly due to error Building wheels for collected packages: hdbscan. Found the solution from your other replies. Thank you!
hi, how did you resolve it? i've been having the same issue and now created multiple environments but no luck
Hi @firoznamaji, I solved it by downloading C++. If you're on Windows, you might need Microsoft Visual C++ Build Tools. You can download them from the official Microsoft website. This is a common requirement for building Python packages from source.
@bray2016 Check if you are getting an error Building wheels for collected packages: hdbscan while installing BERTopic. If so try following my previous reply of downloading C++
@bray2016 Check if you are getting an error Building wheels for collected packages: hdbscan while installing BERTopic. If so try following my previous reply of downloading C++
It's working now, but I'm not entirely sure what fixed it. I already had MV C++ installed. I had been creating new environments, restarting my computer, etc. I assume one of the fixes here worked, so thank you! Sorry I can't point to one in particular.
这是 hdbscan 的问题,而不是 BERTopic,可以使用此方法解决:scikit-learn-contrib/hdbscan#600(评论)
thank u so much!