BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

representation_model: 'NoneType' object is not iterable

Open muehlhausen opened this issue 1 year ago • 4 comments

Hey there!

First of all: thank you for developing BERTopic, it's neat! However, I am encountering an issue with representation_model, when trying to rename my cluster representations. Everything works fine as long as I am using just an embedding_model. However, as soon as I start using a representation_model I get the same error consistently.

Here is some sample code, inspired by this documentation.

# Import the necessary libraries
from bertopic import BERTopic
import pandas as pd
from transformers import pipeline
from bertopic.representation import TextGeneration

# prompt = f"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?"

# Create your representation model
generator = pipeline('text2text-generation', model='google/flan-t5-base')
representation_model = TextGeneration(generator)

# 4. Get some sample data
data = pd.read_excel(testdata.xlsx')

# 5. Initialize BERTopic with the representation model
topic_model = BERTopic(
    embedding_model= 'paraphrase-multilingual-mpnet-base-v2',
    representation_model = representation_model # if commented, code works
)

# 6. Fit BERTopic to the sample texts
topics, _ = topic_model.fit_transform(data['text'])

# 6. Get the topic information
topic_info = topic_model.get_topic_info()

# 7. Print the topic information
print(topic_info)

The error I get is:

TypeError                                 Traceback (most recent call last)
Cell In[3], line 26
     20 topic_model = BERTopic(
     21     embedding_model= 'paraphrase-multilingual-mpnet-base-v2',
     22     representation_model = representation_model
     23 )
     25 # 6. Fit BERTopic to the sample texts
---> 26 topics, _ = topic_model.fit_transform(data['Absatz'])
     28 # 6. Get the topic information
     29 topic_info = topic_model.get_topic_info()

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:433, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    430     self._save_representative_docs(custom_documents)
    431 else:
    432     # Extract topics by calculating c-TF-IDF
--> 433     self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
    435     # Reduce topics
    436     if self.nr_topics:

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:3637, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose)
   3635 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
   3636 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
-> 3637 self.topic_representations_ = self._extract_words_per_topic(words, documents)
   3638 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
   3639 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
   3640                       for key, values in
   3641                       self.topic_representations_.items()}

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:3922, in BERTopic._extract_words_per_topic(self, words, documents, c_tf_idf, calculate_aspects)
   3920         topics = tuner.extract_topics(self, documents, c_tf_idf, topics)
   3921 elif isinstance(self.representation_model, BaseRepresentation):
-> 3922     topics = self.representation_model.extract_topics(self, documents, c_tf_idf, topics)
   3923 elif isinstance(self.representation_model, dict):
   3924     if self.representation_model.get("Main"):

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/representation/_textgeneration.py:147, in TextGeneration.extract_topics(self, topic_model, documents, c_tf_idf, topics)
    143 updated_topics = {}
    144 for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
    145 
    146     # Prepare prompt
--> 147     truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
    148     prompt = self._create_prompt(truncated_docs, topic, topics)
    149     self.prompts_.append(prompt)

TypeError: 'NoneType' object is not iterable

Running it on an M1 Mac, if that helps. Any help appreciated. Also tried copying all code from the best practise and got the same error.

Best regards! Alex Mühlhausen

muehlhausen avatar Jan 16 '24 14:01 muehlhausen

In all honesty, not sure what is happening here. I believe there is another issue open with the same problem but it might just be related to the underlying T5 model. Also, have you tried passing the documents as a list of strings instead of a pandas series?

MaartenGr avatar Jan 16 '24 17:01 MaartenGr

I'm facing the same issue, but only with the TextGeneration representation model. I can generate other representation models without an issue. I did try passing the documents as a list of string, but the error persists.

I have the same code running successfully on v0.15.0

Edit: I did some digging, and found the problem is in this line. It seems that whenever using the default prompt, the top representative documents will be None.

A simple fix for it would be to have the else condition in line 141 assigning an empty list as the default value. I opened a PR with this change

leoschet avatar Jan 16 '24 21:01 leoschet

Thanks for the PR. I just merged #1726 which should fix the issue. Could one of you test it out so I know it also works for others?

MaartenGr avatar Jan 17 '24 05:01 MaartenGr

Thanks for the update! I tested it, and it runs without any errors on my end.

leoschet avatar Jan 17 '24 09:01 leoschet