BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Possible "off by 1" bug in transform() when using reloaded model?

Open A-Posthuman opened this issue 1 year ago • 12 comments

Hi,

While attempting to utilize the best practice for running inference on additional data with an existing bertopic model, I followed the advice to save and reload the model.

The code for the initial model looks like:

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model, representation_model=representation_model, verbose=True, calculate_probabilities=False, 
                       n_gram_range=(1, 2), nr_topics=max_topics)
topics, probs = topic_model.fit_transform(docs, embeddings)

where max_topics is set to 150. This results in 150 topics being created, including the -1 outliers topic, so the regular topics are numbered from 0 to 148.

I then reduce outliers:

new_topics = topic_model.reduce_outliers(documents=docs, topics=topics, strategy="embeddings", embeddings=embeddings)
topic_model.update_topics(docs, topics=new_topics, n_gram_range=(1, 2), vectorizer_model=vectorizer_model, representation_model=representation_model)

after this there is no more -1 topic, we are left with topics numbered 0 to 148, which I've confirmed by printing out the topics:

topic_info = topic_model.get_topic_info()
log.debug(f"frequent topics:\n{topic_info.to_string()}")

So then I save and reload the model:

topic_model.save(path="saved_bertopic", serialization="safetensors", save_ctfidf=True, save_embedding_model=True)
topic_model = BERTopic.load("saved_bertopic", embedding_model=None)

I use the save_embedding_model=True despite using my own custom embeddings, due to a bug. I then load in more docs and more embeddings and use transform(), and check what is the minimum and maximum topic number assigned:

more_topics, more_probs = topic_model.transform(documents=more_docs, embeddings=more_embeddings)
log.debug(f"min_topic_number found in more_topics: {np.min(more_topics)}")
log.debug(f"max_topic_number found in more_topics: {np.max(more_topics)}")

This results in the log showing min number is 0, and max number is 149. The 149 topic number, is that incorrect? It causes my program to hit a bug later on when trying to access the topic name, because that index doesn't exist.

Also I verified if I do not save/reload the model, but instead do the inference on the more_docs via the original model, then the min/max topic numbers are 0 and 148.

So it appears somehow transform() on a saved/reloaded model is generating 1 extra topic somehow? Please let me know if I'm using the software incorrectly, thank you.

A-Posthuman avatar Feb 15 '24 04:02 A-Posthuman

I have found if I comment out the reduce_outliers() and update_topics() calls in my program before saving the model and reloading it, then the reloaded model works without problem when using transform() - the max topic number returned is the correct 148.

Likely I may have not understood fully how to best use reduce_outliers() and update_topics(), do they need to be not used prior to saving a model that I plan to reload in the future to use for more inference?

A-Posthuman avatar Feb 15 '24 04:02 A-Posthuman

What about the actual topics? Are they different in the two cases? Presumably if an additional topic is returned something is going to be different and I think it would be useful to work out if the topics are offset or totally different or ... what is the additional topic?

anirban-mu avatar Feb 15 '24 16:02 anirban-mu

I just checked with this code:

topic_info = topic_model.get_topic_info()
log.debug(f"topics:\n{topic_info.to_string()}")

When I revert my program to how it originally was, with the reduce_outliers() and update_topics() calls prior to saving/reloading the model, the output of this debugging print has the same topics when I log the topics right before reloading the model and just after reloading. They number from 0 to 148, no outlier topic, and no topic 149. Yet the call to transform() results in some documents being assigned to non-existent topic 149.

When I put the reduce_outliers() and update_topics() calls after saving the model, and after using transform() on the reloaded model, then I have the expected -1 outliers topic logged, along with the same 148 topics for both the before saving and after reloading models, but this time the transform() call works normally with no documents assigned to topic 149.

A-Posthuman avatar Feb 16 '24 02:02 A-Posthuman

“ some documents being assigned to non-existent topic 149.”

What topics were these documents assigned to prior? Same topics or distinct topics? Is there pattern such as 147 mapped to 148, 148 to 149, etc.?

I am looking into the transform code because I have a separate but perhaps related issue. But I haven’t made headway because my issue is hard to replicate.

anirban-mu avatar Feb 16 '24 02:02 anirban-mu

The documents assigned to 149 are from a batch of new documents my program has loaded in, to perform additional inference on, after it has reloaded the bertopic model, so they weren't assigned to anything prior.

I made an alternate version of my program now to add more logging to see what is the text of documents that end up in the 149 topic, and based on their text they look like docs that should or could probably have been assigned to topic 148 if they hadn't been put in 149.

A-Posthuman avatar Feb 16 '24 03:02 A-Posthuman

Thanks. That is super helpful. One more question. Does it look like the new documents assigned to topic 148 would be better suited to 147?

Some of these issues seem to coalesce around self.outliers whereby what the value is 0 or -1, and whether it is added or subtracted, maybe the source of the issue. I would hazard a guess that perhaps self.outliers = 0 when the topic misalignment occurs in your case.

anirban-mu avatar Feb 16 '24 19:02 anirban-mu

Yes your hunch appears correct after I examined my logs further, the new documents all seem to be "off by 1" on their topic numbers. The 148 docs should actually be in 147. The docs assigned to 1 should be in 0, etc.

The docs assigned to 0... hard to say but they could be outliers, they seem a bit random looking.

The topic names, and keywords do correctly match to the right topic numbers. Topic 149 has a missing name, and keywords are False.

A-Posthuman avatar Feb 17 '24 04:02 A-Posthuman

Ok great. self.outliers is accessible as a property of a trained model object. If my hunch is correct, the value will be different.

anirban-mu avatar Feb 17 '24 06:02 anirban-mu

Ok I checked self._outliers at 3 points in the process, here are the results:

  1. On the initial model, just after fit_transform(): 1
  2. After reducing outliers and update_topics(): 0
  3. After saving/reloading the model: 0

A-Posthuman avatar Feb 17 '24 15:02 A-Posthuman

@A-Posthuman Hmmm, I'm not quite sure what is happening here since self._outliers should be 0 after step 2, which it is in your case.

Could you try running the model without nr_topics=max_topics?

MaartenGr avatar Feb 18 '24 18:02 MaartenGr

Ok here is a run without using nr_topics:

  1. On the initial model, just after fit_transform(): 1
  2. After reducing outliers and update_topics(): 0
  3. After saving/reloading the model: 0

So same results. Topic count increased to 330, but otherwise behavior was the same, with an extra topic 331 showing up when using transform() on the reloaded model.

A-Posthuman avatar Feb 18 '24 18:02 A-Posthuman

Two questions:

  1. Is the following a pathway to an error if an AttributeError occurs?

https://github.com/MaartenGr/BERTopic/blob/99ee553e3ee00fa7189d3210bdc618a7c7a943c8/bertopic/_bertopic.py#L333 https://github.com/MaartenGr/BERTopic/blob/99ee553e3ee00fa7189d3210bdc618a7c7a943c8/bertopic/_bertopic.py#L411 https://github.com/MaartenGr/BERTopic/blob/99ee553e3ee00fa7189d3210bdc618a7c7a943c8/bertopic/_bertopic.py#L3456 https://github.com/MaartenGr/BERTopic/blob/99ee553e3ee00fa7189d3210bdc618a7c7a943c8/bertopic/_bertopic.py#L3485 https://github.com/MaartenGr/BERTopic/blob/99ee553e3ee00fa7189d3210bdc618a7c7a943c8/bertopic/_bertopic.py#L3492

  1. Should this be -3 or -2?

https://github.com/MaartenGr/BERTopic/blob/99ee553e3ee00fa7189d3210bdc618a7c7a943c8/bertopic/_bertopic.py#L4215 https://github.com/MaartenGr/BERTopic/blob/99ee553e3ee00fa7189d3210bdc618a7c7a943c8/bertopic/_bertopic.py#L4387

Sorry if I am off base. This is an incredibly useful library and I am grateful for your efforts in developing and maintaining it. Debugging through it though is a bit ... complex. :-)

anirban-mu avatar Feb 18 '24 20:02 anirban-mu

Is the following a pathway to an error if an AttributeError occurs?

I do not think so since the OP is not using semi-supervised topic modeling.

Should this be -3 or -2?

That should be -3 as you have the original topics that were created and topics that were re-ordered multiple times.

@A-Posthuman Let's take a step back, shall we? Could you share your full code? An end-to-end example would be great as it shows more information about sub-models and even packages that you use. Moreover, could you check whether some attributes of BERTopic actually contain the additional topic? For instance, do either topic_model.topic_representations_ or topic_model.topic_embeddings_ indicate the appearance of an additional topic? It might be that the embeddings or not properly updated.

MaartenGr avatar Feb 19 '24 05:02 MaartenGr

topic_model.topic_representations_ has these characteristics at the 3 points in the program I was checking self._outliers:

  1. has a -1 key, and last key is 148 (this run was limited to 150 topics wit nr_topics
  2. 0 is now the first key, last key is 148
  3. 0 is first key, last key is 148

topic_model.topic_embeddings_ characteristics:

  1. shape is (150, 384)
  2. shape is (150, 384)
  3. shape is (150, 384)

As for the code, I provided a lot of the bertopic-specific parts already. The entire program is very long so I can't provide the whole program. Here are some other bertopic related parts:

docs are just some text strings added onto a list: docs.append(final_title) my embeddings are loaded from a local diskcache Index: embeddings.append(np.array(temp_dict["vector"])) and then that list into a np.array: embeddings = np.array(embeddings)

other parameters:

umap_model = UMAP(n_neighbors=n_neighbors, n_components=5, min_dist=0.0, metric='cosine', low_memory=True, random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
representation_model = MaximalMarginalRelevance(diversity=0.2)

after that the bertopic model is instantiated, and then the various code bits I already described are done, starting with fit_transform(), etc.

prior to doing the first reduce_outliers() call I check if I need to bother with it, by doing:

get_topic_result = topic_model.get_topic(-1)
if (get_topic_result is not False):

after saving the model, I create a topic name dict for use later in my program (which later causes the traceback as there is no entry for topic 149:

topic_name_dict = topic_info.set_index('Topic')['Name'].to_dict()

then before continuing on to the loop in my code that does the inference on my additional documents, I delete the embeddings to save ram:

del embeddings

then the part of the program that does the extra inference, print some info again:

topic_info = topic_model.get_topic_info()
log.debug(f"reloaded frequent topics:\n{topic_info.to_string()}")
num_topics = 0
                for i, row in topic_info.iterrows():
                    if (row["Topic"] != -1):
                        num_topics += 1
log.debug(f"num_topics: {num_topics}")

the additional docs and embeddings are loaded into more_docs and more_embeddings, similar to earlier in the program.

transform() is used as described in my first message, then if necessary reduce_outliers() is done, then model is reloaded for the next use. the additional data is extend() onto my original data, more_embeddings are cleared, and the loop continues.

I don't see any other bertopic-related stuff in my program.

A-Posthuman avatar Feb 19 '24 06:02 A-Posthuman

Thanks for your additional descriptions. It is a bit difficult to piece together these individual components. Would it be possible to create a minimal example that demonstrates this issue? That might speed up finding the underlying cause.

Lastly, what happens if you add the following before running .transform:

topic_model._outliers = 1

Because it seems that the topic embeddings that were generated in your case are somehow still the old embeddings.

MaartenGr avatar Feb 20 '24 06:02 MaartenGr

If I put this before running .transform: topic_model._outliers = 1

The result is I do then have some docs assigned to a topic number -1, and now the maximum topic number assigned to any doc is the correct 148. However this -1 outliers topic doesn't seem fully working in the model, because this code:

get_topic_result = topic_model.get_topic(-1)

seems to return False.

I'm not quite clear what you mean by the "old embeddings", as the original fit model that was saved has a min topic number 0 (outliers all removed) and max topic number of 148. Shouldn't the reloaded model have identical topic numbers?

A-Posthuman avatar Feb 21 '24 05:02 A-Posthuman

I'm not quite clear what you mean by the "old embeddings", as the original fit model that was saved has a min topic number 0 (outliers all removed) and max topic number of 148.

The topic embeddings. I meant that the topic embeddings seem of a different shape than what should be expected. I would have expected them to have shape 149 instead of 150.

Can you prepare a reproducible example? That would make debugging much easier!

MaartenGr avatar Feb 21 '24 13:02 MaartenGr

Ok I have attached a program example that reproduces the issue, I cut out a large part of my program, tried to leave only the key bits, but there is still some stuff you don't need and will have to modify slightly which is regarding how the program loads in the docs and embeddings. In my case they are in a diskcache Index, so be sure to change that to load your own data.

My test data is around 42k docs, and I set the program to have a max_rows variable to limit bertopic to only work on 10k items at a time to limit memory use and to test the issue which only seems to arise later in the program when it tries to reload the saved model and inference further batches of docs 10k at a time. If your test docs are more or less than this, you probably want to modify these numbers so the program will at minimum not process all the docs in the first load section, but rather have to reload the model and load/inference more docs later in the program. The program then will hit an error near the end at line 289 when it tries to access a non-existent topic number 149.

bertopic_all-test3.txt

A-Posthuman avatar Feb 23 '24 23:02 A-Posthuman

Alright, this took a while to figure out but it seems I understand what happened here. Apparently, the internal topic embeddings were not properly updated.

I created a fix in #1809. Could you try that and see if that works for you?

MaartenGr avatar Feb 26 '24 10:02 MaartenGr

@A-Posthuman FYI, I made a small update in that PR that should now be working.

MaartenGr avatar Feb 26 '24 14:02 MaartenGr

That is wonderful that you were able to track down the bug, thanks. As for testing it on my end, unfortunately just in the past couple days my server and python venv containing this code has moved into production, so I'm not able to test this on the production system. I also don't have the time at the moment to setup a completely separate dev instance and venv, at least probably not this week. I would say if any test program (perhaps based off my test program) fails on the old bertopic, but passes with your new patch in place, then I'd consider the bug fixed.

A-Posthuman avatar Feb 26 '24 15:02 A-Posthuman

That's alright! Also, I think in your case a fix would essentially be this after loading the model:

topic_model.topic_embeddings_ = topic_model.topic_embeddings_[1:]

It seemed that the outlier topic embedding was not properly removed, so this should fix it for you. You can keep your environment as is and simply add the above line before doing the inference.

MaartenGr avatar Feb 28 '24 19:02 MaartenGr

perfect, thanks for that tip I will try it out at some point

A-Posthuman avatar Feb 28 '24 20:02 A-Posthuman