BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

`IndexError: list index out of range` when using zeroshot_topic_list in 0.16.1

Open andiwinata opened this issue 1 year ago • 19 comments

Hi, I recently re-ran a notebook for zeroshot_topic_list and got the IndexError: list index our of range I fixed this by downgrading to 0.16.0

Full stacktrace:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[2], line 18
      9 vectorizer_model = CountVectorizer(stop_words="english")
     11 topic_model = BERTopic(
     12     min_topic_size=20,
     13     zeroshot_topic_list=zeroshot_topic_list,
     14     zeroshot_min_similarity=.25,
     15     vectorizer_model=vectorizer_model
     16 )
---> 18 topics, probs = topic_model.fit_transform(docs)
     19 topic_model.get_topic_info()

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    446 # Combine Zero-shot with outliers
    447 if self._is_zeroshot() and len(documents) != len(doc_ids):
--> 448     predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
    450 return predictions, self.probabilities_

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:3682, in BERTopic._combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
   3680 cluster_indices = list(documents.Old_ID.values)
   3681 cluster_names = list(merged_model.topic_labels_.values())[len(set(y)):]
-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
   3684 df = pd.DataFrame({
   3685     "Indices": zeroshot_indices + cluster_indices,
   3686     "Label": zeroshot_topics + cluster_topics}
   3687 ).sort_values("Indices")
   3688 reverse_topic_labels = dict((v, k) for k, v in merged_model.topic_labels_.items())

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:3682, in <listcomp>(.0)
   3680 cluster_indices = list(documents.Old_ID.values)
   3681 cluster_names = list(merged_model.topic_labels_.values())[len(set(y)):]
-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
   3684 df = pd.DataFrame({
   3685     "Indices": zeroshot_indices + cluster_indices,
   3686     "Label": zeroshot_topics + cluster_topics}
   3687 ).sort_values("Indices")
   3688 reverse_topic_labels = dict((v, k) for k, v in merged_model.topic_labels_.items())

andiwinata avatar Apr 26 '24 08:04 andiwinata

Hmmm, this is surprising. Could you share your full code? That will make it easier to understand what is happening here. Also, I'm not seeing the actual error in your log. Does that mean that the error indeed happens at this line?

-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]

MaartenGr avatar Apr 26 '24 08:04 MaartenGr

I have the same error.

Bougeant avatar Apr 26 '24 13:04 Bougeant

@Bougeant Could you also share your code and error log? That would help me understand what is happening here.

MaartenGr avatar Apr 26 '24 13:04 MaartenGr

Sure! Here goes:

pip install bertopic==0.16.1 datasets

import logging
import pandas as pd
import spacy
from sklearn.datasets import fetch_20newsgroups
from bertopic import BERTopic
from bertopic.representation import PartOfSpeech
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

spacy.cli.download("en_core_web_md")

data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame({"text": data['data'], "target": data['target']})
df = df.drop_duplicates(subset=["text"]).reset_index(drop=True)
classes = {i: data["target_names"][i] for i in range(len(data["target_names"]))}
df["target"] = df["target"].map(classes)

model_params = {
    "embedding_model": SentenceTransformer("all-MiniLM-L6-v2"),
    "calculate_probabilities": True,
    "representation_model": PartOfSpeech(model="en_core_web_md", top_n_words=20, pos_patterns=[[{"POS": "NOUN"}]]),
    "min_topic_size": 100,
    "nr_topics": 20,
    "zeroshot_topic_list": ["baseball", "hockey", "space", "medecine", "encryption", "middle-east politics", "cars", "motorcycle", "electronics", "computers", "religion"],
    "zeroshot_min_similarity": 0.5
}

topic_model = BERTopic(**model_params)
embeddings = topic_model.embedding_model.encode(df["text"], show_progress_bar=True)
topic_model.fit(df["text"].to_list(), embeddings)

This is the error I get:

cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
--> IndexError: list index out of range

It seems that the error comes from the fact that cluster_names should not include the outliers clusters, so the last index is out of range (we try to get the 14th element of a 13 elements list):

cluster_names = ['0_game_team_year_games', '1_health_patients_doctor_treatment', '2_car_bike_one_engine', '3_use_windows_one_system', '4_people_one_children_up', '5_people_arabs_one_peace', '6_health_mail_list_newsgroup', '7_space_launch_earth_orbit', '8_key_clipper_chip_encryption', '9_gay_people_sex_men', '10_post_people_one_flame', '11_one_will_people_christian', '12_fire_compound_children_people', '13_gun_guns_firearms_people']
topic = 13
self._outliers = 1

Bougeant avatar Apr 26 '24 15:04 Bougeant

Hi,

I am having the same issue (zero shot topic modelling crashes at the exact same line).

The code:

representation_model = KeyBERTInspired()
vectorizer_model = CountVectorizer(
    ngram_range=(1, 2), stop_words="english", min_df=30
)
embedding_model = "all-MiniLM-L6-v2"
topic_model = BERTopic(
    verbose=True,
    embedding_model=embedding_model,
    min_topic_size=50,
    calculate_probabilities=True,
    low_memory=True,
    representation_model=representation_model,
    zeroshot_topic_list=labels,
    zeroshot_min_similarity=0.5,
    language="english",
    n_gram_range=(1, 2),
)
topics, probs = topic_model.fit_transform(articles["abstract"].tolist())

I have printed out the following variables before the crash:

len(cluster_names): 78 np.max(documents.Topic.values): 77 np.min(documents.Topic.values): -1 self._outliers: 1 len(set(y)): 13 (which is also equal to len(labels), the amount of input zero shot labels)

In other words, the issue is the same as that reported by @Bougeant.

lucasgautheron avatar Apr 28 '24 09:04 lucasgautheron

sorry a bit late, but this is my code

from bertopic import BERTopic
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer

data = load_dataset("HuggingFaceH4/h4_10k_prompts_ranked_gen")
docs = data["train_gen"]["prompt"]

zeroshot_topic_list = ['searching knowledge', 'answer coding problem', 'summarizing', 'rephrasing', 'roleplay', 'translate', 'generate content']
vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
    min_topic_size=20,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.25,
    vectorizer_model=vectorizer_model
)

topics, probs = topic_model.fit_transform(docs)
topic_model.get_topic_info()

I'm running this in kaggle notebook, and I think I missed adding the last line of the error, this is the full screenshot:

image

andiwinata avatar Apr 29 '24 05:04 andiwinata

accidentally closed the issue, sorry

andiwinata avatar Apr 29 '24 05:04 andiwinata

I've gotten around the problem with the following patch: https://github.com/MaartenGr/BERTopic/compare/master...lucasgautheron:BERTopic:patch-1

This is probably not the way you want to actually fix it, but I thought I should share

lucasgautheron avatar Apr 29 '24 09:04 lucasgautheron

Thank you all for sharing the code! In all honesty, I'm not entirely sure why it suddenly seems to ignore outliers as the topic label should exist...

Either way, I think I managed to create a fix but it still has to pass all the tests. Also, seeing as how the tests didn't cover this specific issue. Could any facing this issue also test whether this fix worked for them? I would feel a lot more confident to have addressed this issue if it resolves it for more people than just on my machine.

Here's the PR: https://github.com/MaartenGr/BERTopic/pull/1957

MaartenGr avatar Apr 29 '24 13:04 MaartenGr

@lucasgautheron @andiwinata @Bougeant If you have the time, could you check whether https://github.com/MaartenGr/BERTopic/pull/1957 works?

MaartenGr avatar May 04 '24 08:05 MaartenGr

Hi! Any updates on that? This is a big blocker in my project right now.

mzhadigerov avatar May 06 '24 18:05 mzhadigerov

@mzhadigerov Have you tested the PR I linked in my comment above? If that works for you and also for others, then I can go ahead and create a new release. Until then, please check out the PR.

MaartenGr avatar May 06 '24 18:05 MaartenGr

@MaartenGr Thanks! It is working on my side. I cloned from fix_1946 branch.

image

mzhadigerov avatar May 06 '24 18:05 mzhadigerov

@MaartenGr but my Representative_Docs of topic -1 are NaN for some reason, even though Count shows 424

mzhadigerov avatar May 06 '24 18:05 mzhadigerov

@mzhadigerov The representative documents are not merged since they are essentially random documents when it concerns topic -1. Topic -1 consists of outliers that do not fall into a single group so the resulting documents are not actually related to one another.

I think it could be done to add representative documents there but in all honesty, I'm not sure it is worth the effort.

MaartenGr avatar May 07 '24 13:05 MaartenGr

@MaartenGr Alright, If it is supposed to work like that (I don't use rep.docs of topic -1 anyways).

I made the comment because the Rep.Docs of -1 are not NaN in v0.16.0

mzhadigerov avatar May 07 '24 19:05 mzhadigerov

@mzhadigerov Thanks for sharing. It is currently low priority but I might bump it if it's important to many users.

MaartenGr avatar May 07 '24 20:05 MaartenGr

For everyone facing this issue in 0.16.1, I just pushed an official 0.16.2 release which has the PR I mentioned earlier implemented. There are a bunch of PRs open with a number of interesting stuff that I will look through in the upcoming weeks. For now, this issue should be resolved.

MaartenGr avatar May 12 '24 09:05 MaartenGr

Thank you for the super quick patch; I could not try it yet, but it looks equivalent to my quickfix so I assume it works.

lucasgautheron avatar May 12 '24 09:05 lucasgautheron