BERTopic
BERTopic copied to clipboard
representation_model: 'NoneType' object is not iterable
Hey there!
First of all: thank you for developing BERTopic, it's neat! However, I am encountering an issue with representation_model, when trying to rename my cluster representations. Everything works fine as long as I am using just an embedding_model. However, as soon as I start using a representation_model I get the same error consistently.
Here is some sample code, inspired by this documentation.
# Import the necessary libraries
from bertopic import BERTopic
import pandas as pd
from transformers import pipeline
from bertopic.representation import TextGeneration
# prompt = f"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?"
# Create your representation model
generator = pipeline('text2text-generation', model='google/flan-t5-base')
representation_model = TextGeneration(generator)
# 4. Get some sample data
data = pd.read_excel(testdata.xlsx')
# 5. Initialize BERTopic with the representation model
topic_model = BERTopic(
embedding_model= 'paraphrase-multilingual-mpnet-base-v2',
representation_model = representation_model # if commented, code works
)
# 6. Fit BERTopic to the sample texts
topics, _ = topic_model.fit_transform(data['text'])
# 6. Get the topic information
topic_info = topic_model.get_topic_info()
# 7. Print the topic information
print(topic_info)
The error I get is:
TypeError Traceback (most recent call last)
Cell In[3], line 26
20 topic_model = BERTopic(
21 embedding_model= 'paraphrase-multilingual-mpnet-base-v2',
22 representation_model = representation_model
23 )
25 # 6. Fit BERTopic to the sample texts
---> 26 topics, _ = topic_model.fit_transform(data['Absatz'])
28 # 6. Get the topic information
29 topic_info = topic_model.get_topic_info()
File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:433, in BERTopic.fit_transform(self, documents, embeddings, images, y)
430 self._save_representative_docs(custom_documents)
431 else:
432 # Extract topics by calculating c-TF-IDF
--> 433 self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
435 # Reduce topics
436 if self.nr_topics:
File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:3637, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose)
3635 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
3636 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
-> 3637 self.topic_representations_ = self._extract_words_per_topic(words, documents)
3638 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
3639 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
3640 for key, values in
3641 self.topic_representations_.items()}
File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:3922, in BERTopic._extract_words_per_topic(self, words, documents, c_tf_idf, calculate_aspects)
3920 topics = tuner.extract_topics(self, documents, c_tf_idf, topics)
3921 elif isinstance(self.representation_model, BaseRepresentation):
-> 3922 topics = self.representation_model.extract_topics(self, documents, c_tf_idf, topics)
3923 elif isinstance(self.representation_model, dict):
3924 if self.representation_model.get("Main"):
File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/representation/_textgeneration.py:147, in TextGeneration.extract_topics(self, topic_model, documents, c_tf_idf, topics)
143 updated_topics = {}
144 for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
145
146 # Prepare prompt
--> 147 truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
148 prompt = self._create_prompt(truncated_docs, topic, topics)
149 self.prompts_.append(prompt)
TypeError: 'NoneType' object is not iterable
Running it on an M1 Mac, if that helps. Any help appreciated. Also tried copying all code from the best practise and got the same error.
Best regards! Alex Mühlhausen
In all honesty, not sure what is happening here. I believe there is another issue open with the same problem but it might just be related to the underlying T5 model. Also, have you tried passing the documents as a list of strings instead of a pandas series?
I'm facing the same issue, but only with the TextGeneration representation model. I can generate other representation models without an issue. I did try passing the documents as a list of string, but the error persists.
I have the same code running successfully on v0.15.0
Edit: I did some digging, and found the problem is in this line. It seems that whenever using the default prompt, the top representative documents will be None.
A simple fix for it would be to have the else condition in line 141 assigning an empty list as the default value. I opened a PR with this change
Thanks for the PR. I just merged #1726 which should fix the issue. Could one of you test it out so I know it also works for others?
Thanks for the update! I tested it, and it runs without any errors on my end.