Guided topic modelling np.average function not behaving as expected?
Issue:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part. encountered during model fitting with provided seed words for guided topic modelling
Description
A ValueError was encountered when attempting to fit a topic model using BERTopic with the following configuration:
# Doc is a corpus of about 3K posts
doc = df[df['post_text_clean'] != '']['post_text_clean'].tolist()
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = sentence_model.encode(doc, show_progress_bar=True)
seed_topic_list = [
['surgery', 'bottom', 'top', 'hormone'],
['accept', 'acceptance', 'strength', 'needs'],
['connectedness', 'support', 'activism', 'mentor'],
['stopped', 'cancelled', 'pass', 'confusion'],
['peers', 'family', 'friends', 'group'],
['anxiety', 'depression', 'dissociation', 'anorexia'],
['dysphoria', 'familial', 'stress', 'health'],
['impulsive', 'introverted', 'sensitivity', 'shame'],
['violence', 'rejection', 'victimization', 'affirmation'],
['ideation', 'attempt', 'risk', 'prevention']
]
vectorizer_model = CountVectorizer(stop_words = 'english')
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
main_representation_model = KeyBERTInspired()
aspect_representation_model1 = PartOfSpeech("en_core_web_sm")
aspect_representation_model2 = [KeyBERTInspired(top_n_words=30),
MaximalMarginalRelevance(diversity=.5)]
topic_model = BERTopic(embedding_model=sentence_model,
calculate_probabilities=True,
vectorizer_model = vectorizer_model,
ctfidf_model=ctfidf_model,
representation_model = representation_model,
seed_topic_list=seed_topic_list
)
topic, probs = topic_model.fit_transform(doc, embedding)
The error occurs when calling the fit_transform method on a BERTopic instance with a set of documents and their embeddings.
Probably the internal call to np.average is not behaving as expected?
When attempting to use np.average to compute a weighted average of document embeddings and seed topic embeddings, the ValueError is encountered due to passing a list of arrays with different shapes to np.average, leading to an inhomogeneous shape. This should update the document embeddings with a weighted influence from corresponding seed topic embeddings.
Steps to Reproduce
- Installed numpy version: 1.25.0
- Initialize BERTopic model with guided modelling approach.
- Prepare a dataset of documents and their corresponding embeddings.
- Call the
fit_transformmethod on the BERTopic model.
Error Traceback
ValueError Traceback (most recent call last)
Cell In[7], line 104
102 # Topic Model Fitting
103 print("Topic model fitting..")
--> 104 topic, probs = topic_model.fit_transform(doc, embedding)
106 # Save Model State Checkpoint
107 print("Saving model embeddings checkpoint..")
File c:\Users\georg\anaconda3\Lib\site-packages\bertopic\_bertopic.py:399, in BERTopic.fit_transform(self, documents, embeddings, images, y)
397 # Guided Topic Modeling
398 if self.seed_topic_list is not None and self.embedding_model is not None:
--> 399 y, embeddings = self._guided_topic_modeling(embeddings)
401 # Zero-shot Topic Modeling
402 if self._is_zeroshot():
File c:\Users\georg\anaconda3\Lib\site-packages\bertopic\_bertopic.py:3617, in BERTopic._guided_topic_modeling(self, embeddings)
3615 for seed_topic in range(len(seed_topic_list)):
3616 indices = [index for index, topic in enumerate(y) if topic == seed_topic]
-> 3617 embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1])
3618 logger.info("Guided - Completed \u2713")
File c:\Users\georg\anaconda3\Lib\site-packages\numpy\lib\function_base.py:511, in average(a, axis, weights, returned, keepdims)
398 @array_function_dispatch(_average_dispatcher)
399 def average(a, axis=None, weights=None, returned=False, *,
400 keepdims=np._NoValue):
401 """
402 Compute the weighted average along the specified axis.
403
(...)
509 [4.5]])
510 """
--> 511 a = np.asanyarray(a)
513 if keepdims is np._NoValue:
514 # Don't pass on the keepdims argument if one wasn't given.
515 keepdims_kw = {}
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
Thanks for sharing the extensive description of your issue. I believe this is a known issue for which the fix seems to be to lower the numpy version I believe. Could you check the link I shared for specifics?