BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Good practice when saving mode and the performance on AWS Lambda?

Open phuclh opened this issue 3 years ago • 5 comments
trafficstars

First, thank you for creating this awesome package. It has great documentation for beginners like me. I have set up the BERTopic on AWS Lambda. I am using all-mpnet-base-v2 as the embedding model and I have documents that I crawled from a website.

After running this: topics, _ = topic_model.fit_transform(docs), should I save the topic_model to load it for the next time? I don't understand what will the model save. It will save my docs or the topics? Moreover, what are the benefits when saving the model? Will it just make the next process run faster or still have other benefits when saving the model? I am asking that because AWS Lambda limits the storage to 10GB, so I don't think I can save a lot of models.

Another question is I am using Lambda with 10GB of memory (which is the maximum) but the process is very slow. It takes over 10 minutes for a 40k-row document (It only takes 3 minutes on Google Colab). I read the FAQ and you said to increase the performance by using GPU but AWS doesn't support GPU. Is there any way to increase the performance on AWS Lambda?

Thank you so much!

phuclh avatar Aug 27 '22 03:08 phuclh

After running this: topics, _ = topic_model.fit_transform(docs), should I save the topic_model to load it for the next time?

Yes, you are fitting a model on your data and that model then contains certain parameters and values for finding and predicting topics on unseen documents. So if you want to use your trained model on other documents, then you should definitely save and load the model for the next time.

I don't understand what will the model save. It will save my docs or the topics?

Most notably, the model will save the trained sub-models, like UMAP, HDBSCAN, and c-TF-IDF. All these models are trained on your specific data and we need them if we want to predict topics. The documents themselves will not be saved as they are the data on which you train. That kind of data is typically not saved in models as models are generally not meant for keeping large amounts of data. The topics themselves will definitely be saved as that is what you want to know from your documents, which topics they contain.

Moreover, what are the benefits when saving the model? Will it just make the next process run faster or still have other benefits when saving the model?

Whenever you perform a fit_transform on your data, you will have created a model that has learned from that data. Then, you can use transform to predict topics for unseen documents. This is not possible if you did not train, and thereby save, the model before loading it.

I am asking that because AWS Lambda limits the storage to 10GB, so I don't think I can save a lot of models.

Typically, you only need to train a single model to capture your entire dataset. It depends on the size of your data but there might not be a need to save multiple models.

Another question is I am using Lambda with 10GB of memory (which is the maximum) but the process is very slow. It takes over 10 minutes for a 40k-row document (It only takes 3 minutes on Google Colab). I read the FAQ and you said to increase the performance by using GPU but AWS doesn't support GPU. Is there any way to increase the performance on AWS Lambda?

The embeddings are created using sentence-transformers which is a transformer-based package that works best if you use a GPU. If you have a CPU, then it might be worthwhile to select an embedding technique that is a bit faster on the CPU. Other than that, the all-mpnet-base-v2 is a slower model, although quite accurate, compared to some others out there like all-MiniLM-L6-v2. So using the latter might be worthwhile in your case.

MaartenGr avatar Aug 27 '22 07:08 MaartenGr

Thank you so much for the quick response. You're awesome.

  1. I am crawling data from multiple websites. Is that still good when only saving one model for all websites since each website has a different topic (I used to think to save each model for each website)?

  2. So if I save the model:

topic_model.save("my_model")

Next time, I need to add more documents to the model like this, right?

topic_model = BERTopic.load("my_model")

topics, _ = topic_model.fit_transform(docs)
  1. If I save a model, I cannot change parameters like embedding_model, min_topic_size, nr_topics. Is that correct?

  2. As you said above, the saved model will contain HDBSCAN, UMAP, etc. Therefore, those things won't be run in the subsequent requests, right?

Thank you so much. That's all questions I have.

phuclh avatar Aug 27 '22 07:08 phuclh

I am crawling data from multiple websites. Is that still good when only saving one model for all websites since each website has a different topic (I used to think to save each model for each website)?

A topic model tries to extract a number of topics from a set of documents. It actually performs worse if you give it documents that together only contain a single topic. In other words, it is not necessary to create a model for each website.

Next time, I need to add more documents to the model like this, right?

No. It follows the general principle in machine learning in that you use .fit or .fit_transform to train your data on a set of documents. Then, when you want to use it for unseen documents, you can use .transform.

The pipeline would then look like this:

# Train model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(my_documents)

# Save the model
topic_model.save("my_model")

# Load the model and predict new instances
topic_model = BERTopic.load("my_model")
topics, probs = topic_model.transform(documents_not_seen_before)

If I save a model, I cannot change parameters like embedding_model, min_topic_size, nr_topics. Is that correct?

That is correct. These parameters are used when you fit a mode.

As you said above, the saved model will contain HDBSCAN, UMAP, etc. Therefore, those things won't be run in the subsequent requests, right?

They will indeed contain these models but they will be run in the subsequent requests since those are the models are trained on the data.

MaartenGr avatar Aug 28 '22 06:08 MaartenGr

Oh gotcha. So the good practice is crawling as much as data I can in multiple topics, then save the model and use it whenever I have more unseen documents. Is that correct? Does the transform method still train the model when adding new unseen documents? And should I save the model after using transform method?

How many documents in average I can use for training models? Or can I just use the popular 20 Newsgroups dataset which you installed in this example: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

Thank you!

phuclh avatar Aug 28 '22 07:08 phuclh

Oh gotcha. So the good practice is crawling as much as data I can in multiple topics, then save the model and use it whenever I have more unseen documents. Is that correct?

Yes! However, it is important that you use data that suits your use case. For example, if you work in a hospital, it might make sense to use personal health records to create a topic model. In other words, topic modeling techniques are typically used on data that you have for a specific use case or domain. At the moment, they are not really used for pre-training millions of topics.

Does the transform method still train the model when adding new unseen documents?

No, the model is trained using .fit or .fit_transform and using .transform does not train the model. Do note, that using the .transform method also does not add new unseen documents but merely predicts the topics those documents contain.

And should I save the model after using transform method?

That is not necessary since the .transform method does not change the topic model in any way.

How many documents in average I can use for training models?

That depends on your use case. Typically, I would advise at least a couple of thousand documents.

Or can I just use the popular 20 Newsgroups dataset which you installed in this example: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

I would not advise doing that since the topics extracted from that dataset is likely not matching with your specific use case.

MaartenGr avatar Aug 28 '22 09:08 MaartenGr

Due to inactivity, I'll be closing this for now. If you have any questions or want to continue the discussion, I'll make sure to re-open the issue!

MaartenGr avatar Jan 09 '23 12:01 MaartenGr