BERTopic
BERTopic copied to clipboard
BERTopic on AzureML?
Hello, has anyone successfully got BERTopic running on AzureML?
Environment: Azure ML 3.8
Having installed the BERTopic (pip install BERTopic), I then use the following starter code (from the BERTopic GitHub):
from bertopic import BERTopic from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs) After running for around 4 minutes, this gives the following error: UFuncTypeError: ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types <class 'numpy.dtype[float32]'> -> None
Any support would be gratefully received!
Regards, James
This is the full error stack:
Batches
2022-08-14 14:23:59,573 - BERTopic - Transformed documents to Embeddings
2022-08-14 14:24:11,356 - BERTopic - The dimensionality reduction algorithm did not contain the y
parameter and therefore the y
parameter was not used
UFuncTypeError Traceback (most recent call last) File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/bertopic/_bertopic.py:1391, in BERTopic._reduce_dimensionality(self, embeddings, y) 1390 try: -> 1391 self.umap_model.fit(embeddings, y=y) 1392 except TypeError:
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/umap/umap_.py:2516, in UMAP.fit(self, X, y) 2511 if self.knn_dists is None: 2512 ( 2513 self._knn_indices, 2514 self._knn_dists, 2515 self._knn_search_index, -> 2516 ) = nearest_neighbors( 2517 X[index], 2518 self._n_neighbors, 2519 nn_metric, 2520 self._metric_kwds, 2521 self.angular_rp_forest, 2522 random_state, 2523 self.low_memory, 2524 use_pynndescent=True, 2525 n_jobs=self.n_jobs, 2526 verbose=self.verbose, 2527 ) 2528 else:
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/umap/umap_.py:342, in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, n_jobs, verbose) 328 knn_search_index = NNDescent( 329 X, 330 n_neighbors=n_neighbors, (...) 340 compressed=False, 341 ) --> 342 knn_indices, knn_dists = knn_search_index.neighbor_graph 344 if verbose:
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pynndescent/pynndescent_.py:1532, in NNDescent.neighbor_graph(self) 1529 if self._distance_correction is not None: 1530 result = ( 1531 self._neighbor_graph[0].copy(), -> 1532 self._distance_correction(self._neighbor_graph[1]), 1533 ) 1534 else:
UFuncTypeError: ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types <class 'numpy.dtype[float32]'> -> None
During handling of the above exception, another exception occurred:
UFuncTypeError Traceback (most recent call last)
Input In [3], in
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/bertopic/_bertopic.py:306, in BERTopic.fit_transform(self, documents, embeddings, y) 304 if self.seed_topic_list is not None and self.embedding_model is not None: 305 y, embeddings = self._guided_topic_modeling(embeddings) --> 306 umap_embeddings = self._reduce_dimensionality(embeddings, y) 308 # Cluster reduced embeddings 309 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/bertopic/_bertopic.py:1395, in BERTopic._reduce_dimensionality(self, embeddings, y)
1392 except TypeError:
1393 logger.info("The dimensionality reduction algorithm did not contain the y
parameter and"
1394 " therefore the y
parameter was not used")
-> 1395 self.umap_model.fit(embeddings)
1397 umap_embeddings = self.umap_model.transform(embeddings)
1398 logger.info("Reduced dimensionality")
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/umap/umap_.py:2516, in UMAP.fit(self, X, y) 2510 nn_metric = self._input_distance_func 2511 if self.knn_dists is None: 2512 ( 2513 self._knn_indices, 2514 self._knn_dists, 2515 self._knn_search_index, -> 2516 ) = nearest_neighbors( 2517 X[index], 2518 self._n_neighbors, 2519 nn_metric, 2520 self._metric_kwds, 2521 self.angular_rp_forest, 2522 random_state, 2523 self.low_memory, 2524 use_pynndescent=True, 2525 n_jobs=self.n_jobs, 2526 verbose=self.verbose, 2527 ) 2528 else: 2529 self._knn_indices = self.knn_indices
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/umap/umap_.py:342, in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, n_jobs, verbose) 326 n_iters = max(5, int(round(np.log2(X.shape[0])))) 328 knn_search_index = NNDescent( 329 X, 330 n_neighbors=n_neighbors, (...) 340 compressed=False, 341 ) --> 342 knn_indices, knn_dists = knn_search_index.neighbor_graph 344 if verbose: 345 print(ts(), "Finished Nearest Neighbor Search")
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pynndescent/pynndescent_.py:1532, in NNDescent.neighbor_graph(self) 1528 return None 1529 if self._distance_correction is not None: 1530 result = ( 1531 self._neighbor_graph[0].copy(), -> 1532 self._distance_correction(self._neighbor_graph[1]), 1533 ) 1534 else: 1535 result = (self._neighbor_graph[0].copy(), self._neighbor_graph[1].copy())
UFuncTypeError: ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types <class 'numpy.dtype[float32]'> -> None
There might be an issue with the packages that were already installed in your environment. It might be worthwhile to do pip install --upgrade bertopic
instead to get the most recent packages. Moreover, you can find some solutions to your problem here that you can try out.
Not running on AzureML, but had same error, the following fixes worked: https://github.com/lmcinnes/pynndescent/issues/163#issuecomment-1016694682 https://github.com/lmcinnes/pynndescent/issues/163#issuecomment-1025082538
Due to inactivity, I'll be closing this for now. If you have any questions or want to continue the discussion, I'll make sure to re-open the issue!