BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

help sought to train a big data sentence model (upto 1.5 million sentences)

Open schetudiante opened this issue 3 years ago • 46 comments

Hey Maarten, Firstly thank you for all the help you have been uptill this point! 👍 👍 👍 I want to visualise the top topics using the same logic you so nicely showed here https://github.com/MaartenGr/BERTopic/issues/126#issuecomment-855606679 -thank you for that. ❤️
however I am a bit curious how one could feed a big data set of sentences to the model without blowing up the memory. can you suggest something? like when we do here: topics, _ = topic_model.fit_transform(docs) like how could one feed sentences to the model?

the intention in the end is to finally visualise the top topics , something you already showed: https://github.com/MaartenGr/BERTopic/issues/126#issuecomment-855606679 to get a nice visualistion.

Thanks Maarten for everything 🙏

schetudiante avatar Jun 21 '21 18:06 schetudiante

No problem, glad I could be of help!

There are several ways to perform computation with large datasets. First, you can set low_memory to True when instantiating BERTopic. This may prevent blowing up the memory in UMAP.

Second, setting calculate_probabilities to False when instantiating BERTopic prevents a huge document-topic probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets.

Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting sparse c-TF-IDF matrix. You can do this as follows:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model)

The min_df parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory.

Lastly, and this is a bit on the nose, simply use a machine with more RAM available. Some machines are simply not meant to process such large datasets or memory-intensive algorithms and using one, if available, could help.

Also, make sure you do not actually visualize all 1.5 million points in the visualization I shared with you. Simply take a weighted sample across all topics (e.g., 10%) and visualize those. Otherwise, matplotlib might have some issues plotting all those points.

Hopefully, this helps a bit!

MaartenGr avatar Jun 23 '21 05:06 MaartenGr

你好大神,我计划对百万级别的文本做模型训练,耗时时间过长,我尝试把文本切割成多个小list,增量迭代进行训练,transform不能帮助我实现对模型增量迭代训练,请问还有其他途径吗?

TigerShuai avatar Jun 28 '21 03:06 TigerShuai

@TigerShuai Hopefully, Google Translate was accurate in translating your issue. It seems that you want to iteratively train a BERTopic model since you have too many documents that take too long to train.

Unfortunately, this is not supported and is unlikely to be supported in the future as the model performs best when you use all documents. Having said that, I would advise several things. First, make sure you use a strong GPU. This will speed up the training procedure quite a bit. Second, set low_memory=True if you are experiencing memory issues. Third, set calculate_probabilities=False as that is a very slow procedure. Finally, I would advise you to use verbose=True and see where the training slows down. If I know what takes so long, perhaps I can propose a solution!

MaartenGr avatar Jun 28 '21 07:06 MaartenGr

No problem, glad I could be of help!

There are several ways to perform computation with large datasets. First, you can set low_memory to True when instantiating BERTopic. This may prevent blowing up the memory in UMAP.

Second, setting calculate_probabilities to False when instantiating BERTopic prevents a huge document-topic probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets.

Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting sparse c-TF-IDF matrix. You can do this as follows:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model)

The min_df parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory.

Lastly, and this is a bit on the nose, simply use a machine with more RAM available. Some machines are simply not meant to process such large datasets or memory-intensive algorithms and using one, if available, could help.

Also, make sure you do not actually visualize all 1.5 million points in the visualization I shared with you. Simply take a weighted sample across all topics (e.g., 10%) and visualize those. Otherwise, matplotlib might have some issues plotting all those points.

Hopefully, this helps a bit!

Thank you Maarten for your graciousness! I am just caught about two things:

  1. Could I save the model trained after training completes & then 2. use the saved model (embeddings) to visualise the topics like you say here:

Also, make sure you do not actually visualize all 1.5 million points in the visualization I shared with you. Simply take a weighted sample across all topics (e.g., 10%) and visualize those. Otherwise, matplotlib might have some issues plotting all those points.

Can you if possible: give an example to save the (huge) model and visualise like you say?

It would be an awesome thing (and I already feel like sending you a gift now :) You inspire us!

schetudiante avatar Jun 29 '21 11:06 schetudiante

You can save the model with:

from bertopic import BERTopic
topic_model = BERTopic().fit(docs)
topic_model.save("my_model")

Then, you can use the saved model to visualize the topics. You can find a bit more about saving and loading in the documentation here.

The part about the weighted sample is something you will have to code yourself. Adjust this code to only select a number of documents to visualize (e.g., 100,000 documents instead of 1.3 million). There, you need to add this df = df.sample(n=100_000) directly after df["topic"] = topics to visualize only a sample of the documents.

MaartenGr avatar Jun 30 '21 08:06 MaartenGr

Hey Maarten, Thanks for your 'ever graciousness' big thank you for that Once Again Sir! Can you give a pointer how I can load the documents into the model , because I have 1.5 million sentences (documents) which are in a text file (or I can put them in multiple text files). Because I see that the model takes one argument while loading the documents , like here: topics, _ = topic_model.fit_transform(docs)

What I mean is, before, I would load the documents (upto the size of say, 32000 etc) into a list simply and then put the list as the docs argument, I wonder if the same maybe done for all the 1.5 mil documents or maybe you can suggest some other way.

still waiting on this btw,

It would be an awesome thing (and I already feel like sending you a gift now :) (https://github.com/MaartenGr/BERTopic/issues/151#issuecomment-870529679)

schetudiante avatar Jul 04 '21 11:07 schetudiante

Personally, I would simply put all those 1.5 million sentences (documents) in a list and then put that list as the docs argument. If have you have enough RAM available, this should be no issue. If, however, you run into memory issues, then I would advise you to look here for a few tips on how to run BERTopic on large data.

It would be an awesome thing (and I already feel like sending you a gift now :) (#151 (comment))

Don't worry about that! I'm just glad that I can help out.

MaartenGr avatar Jul 05 '21 13:07 MaartenGr

Personally, I would simply put all those 1.5 million sentences (documents) in a list and then put that list as the docs argument. If have you have enough RAM available, this should be no issue. If, however, you run into memory issues, then I would advise you to look here for a few tips on how to run BERTopic on large data.

It would be an awesome thing (and I already feel like sending you a gift now :) (#151 (comment))

Don't worry about that! I'm just glad that I can help out.

Is it possible to fit first, and then transform the documents in small chunks (i.e. not use fit_transform but calling fit first and then calling transform on smaller chunks of data). @MaartenGr

If I have 1.5 million of sentences, for example, can I fit with all 1.5 million sentences and then transform 500k sentences at a time for 3 times?

ginward avatar Sep 25 '21 16:09 ginward

@ginward You can definitely fit the model once on a subset of the data and simply transform for all others. Typically, you can get away with a few hundred thousand documents. You really do not need to train on millions of sentences to improve the model as sufficient data is most likely already given.

Thus, you can fit on 200,000 sentences and simply predict the other 1.3 million sentences.

The only thing that you should take into account is selecting those 200,000 sentences. If you are looking for very specific topics that are likely to only appear a few thousand times, then there is a good chance that you will not capture those in the model. Thus, proper sampling here is key.

MaartenGr avatar Sep 26 '21 06:09 MaartenGr

@MaartenGr I see. So should I call fit on the 200,000 sentences, and than call transform on the 1.3 million sentences?

ginward avatar Sep 26 '21 08:09 ginward

But if transform takes a lot of memory, can I transform on smaller chunks (such as several 200,000 chunks summing up to 1.3 million sentences), and then combine the results together?

ginward avatar Sep 26 '21 09:09 ginward

Yes, you can fit on the 200,00 sentences, and then call transform on the remaining 1.3 million sentences as long as you are sure that the 200,000 sentences are a good representation of the remaining 1.3 million sentences.

The fit stage can take a lot of memory, whereas the transform stage should be much smaller. It should be okay to transform them all at once. However, if you are still experiencing memory issues, there should be no issue in separating them into smaller chunks and combining the results.

MaartenGr avatar Sep 26 '21 09:09 MaartenGr

@MaartenGr I see. Is it possible to further reduce the memory usage by tuning the hyperparameter of UMAP (i.e. reducing the dimensionality of the document embeddings further) or HDBSCAN (fewer clusters)? And why did you use HDBSCAN for nearest neighbours, but not K-nearest neighbours algorithm?

ginward avatar Sep 26 '21 10:09 ginward

For example, if K-means consume less memory, can we use k-means instead of HDBSCAN?

ginward avatar Sep 26 '21 10:09 ginward

It seems that the memory issues occur not in the Sentence Embedding stage or the HMAP stage, but the HDBSCAN stage. I currently have about 10 million short sentences. I think it is the final stage that the memory usage shoots up.

ginward avatar Sep 26 '21 10:09 ginward

There are a few tricks you can do with respect to UMAP and HDBSCAN, there are outlined here. In practice, there are a number of places where memory consumption may increase (UMAP, HDBSCAN, c-TF-IDF, etc.) that you can optimize in the link above.

Swapping out HDBSCAN for k-Means will reduce in a significantly less accurate model. There are quite a few benefits to HDBSCAN over k-Means including outlier detection, hierarchical nature, density efficiency, etc.

With 10 million sentences, I would advise not to try to optimize the algorithms but focus on the implementation of BERTopic. Like you mentioned, fit on a subset and predict for all others.

MaartenGr avatar Sep 26 '21 11:09 MaartenGr

@MaartenGr What if I reduce the UMAP reduced dimension to 2 (in the source code it was set to five originally)? Would that relieve some of the burden that HDBSCAN bears?

ginward avatar Sep 26 '21 11:09 ginward

That would likewise reduce the quality of the model and is not something I would recommend. Since you have millions of data points, I would not advise training on the entire dataset to lower the memory necessity. That seems to be the most efficient way of handling this without the need to optimize/chance/adopt the sub-algorithms.

MaartenGr avatar Sep 26 '21 11:09 MaartenGr

@MaartenGr Thanks. What is the maximum number of sentences that the model can handle from your experience?

ginward avatar Sep 26 '21 11:09 ginward

This is a difficult question to answer since it highly depends on your hardware specs. A free google colab session handles a couple of hundred thousand sentences without issues but runs into issues when you approach the million. However, there are plenty of organizations (including where I work currently) that can handle a couple of million sentences without any problems.

Also, it depends on the length of the sentences, the number of words, vocabulary size, etc.

MaartenGr avatar Sep 26 '21 11:09 MaartenGr

@MaartenGr Is there also a way to separate the process of sentence embedding, UMAP and HDBSCAN by saving the intermediary models? If the memory blows up at the last stage (HDBSCAN), I would need to re-do the sentence embedding and UMAP parts, and it is going to take another few hours.

ginward avatar Sep 26 '21 11:09 ginward

This is a difficult question to answer since it highly depends on your hardware specs. A free google colab session handles a couple of hundred thousand sentences without issues but runs into issues when you approach the million. However, there are plenty of organizations (including where I work currently) that can handle a couple of million sentences without any problems.

Also, it depends on the length of the sentences, the number of words, vocabulary size, etc.

I currently have a Colab Pro+ subscription with 55GB RAM, and the model seems to work through the sentence embedding stage and the UMAP stage quite well, but dies at the very last stage HDBSCAN.

I have have a HPC access to a machine with 4 GPUs and 96GB RAM in total. But I can only use one GPU and it still blows up the very last stage.

ginward avatar Sep 26 '21 11:09 ginward

I am not sure if a single GPU card can use all the 96GB RAM available in the machine, as the other 48GB is in the other three GPU cards. But nevertheless, the model still blows up at the last stage. @MaartenGr

ginward avatar Sep 26 '21 11:09 ginward

You can try to embed the sentences beforehand by following this piece of documentation. After that, you can simply save the embeddings and load them in when necessary. There currently is not an implementation for UMAP.

I currently have a Colab Pro+ subscription with 55GB memory, and the model seems to work through the sentence embedding stage and the UMAP stage quite well, but dies at the very last stage HDBSCAN. I have have a HPC access to a machine with 4 GPUs and 96GB memories in total. But I can only use one GPU and it still blows up the very last stage.

The "last stage" technically is not HDBSCAN but topic extraction with c-TF-IDF and MMR. Having said that, I cannot judge what exactly is happening here without knowing the code you are using. Could you share the code for training BERTopic? Also, if you have set verbose=True, what has it printed until you get the memory issues?

I have have a HPC access to a machine with 4 GPUs and 96GB RAM in total. But I can only use one GPU and it still blows up the very last stage.

Do you mean VRAM or RAM? HDBSCAN is not gpu-accelerated.

MaartenGr avatar Sep 26 '21 11:09 MaartenGr

@MaartenGr If only the sentence transformer part is done on GPU, can I train the embeddings first and then run the other parts on a machine with only CPU access? I have a machine with 128 GB CPU RAM, which might just work OK.

ginward avatar Sep 26 '21 11:09 ginward

@MaartenGr It is 96GB RAM and 16GB VRAM. Apparently 96GB RAM is not enough to get the 10 million sentences done.

I am using a customised dataset, but the code is here:

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=20)

umap_model = UMAP(n_neighbors=15, n_components=3, min_dist=0.0, metric='cosine', low_memory = True)

# Setting HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=10, umap_model = umap_model,  metric='euclidean', cluster_selection_method='eom', prediction_data=True)

topic_model = BERTopic(verbose=True, seed_topic_list=seed_topic_list, embedding_model="paraphrase-MiniLM-L3-v2", low_memory=True, calculate_probabilities=False, vectorizer_model=vectorizer_model)

I have set the min_df=20, which is a very large threshold.

ginward avatar Sep 26 '21 11:09 ginward

@MaartenGr Would setting ngram_range=(1, 1) help though? It might reduce the TF-IDF matrix size.

ginward avatar Sep 26 '21 11:09 ginward

Setting ngram_range=(1,1) would help but reduces the ease of interpretation and topic representation quality since 2-grams often give interesting insights. For millions of sentences, a min_df=20 isn't actually a very large threshold. I think it should pose no issue setting that too at least 100. If you have millions of sentences, then the frequencies of words in your vocab tend to be quite large.

If only the sentence transformer part is done on GPU, can I train the embeddings first and then run the other parts on a machine with only CPU access? I have a machine with 128 GB CPU RAM, which might just work OK.

Yes, only the embedding part benefits from having a GPU.

MaartenGr avatar Sep 26 '21 12:09 MaartenGr

You can try to embed the sentences beforehand by following this piece of documentation. After that, you can simply save the embeddings and load them in when necessary. There currently is not an implementation for UMAP.

I currently have a Colab Pro+ subscription with 55GB memory, and the model seems to work through the sentence embedding stage and the UMAP stage quite well, but dies at the very last stage HDBSCAN. I have have a HPC access to a machine with 4 GPUs and 96GB memories in total. But I can only use one GPU and it still blows up the very last stage.

The "last stage" technically is not HDBSCAN but topic extraction with c-TF-IDF and MMR. Having said that, I cannot judge what exactly is happening here without knowing the code you are using. Could you share the code for training BERTopic? Also, if you have set verbose=True, what has it printed until you get the memory issues?

I have have a HPC access to a machine with 4 GPUs and 96GB RAM in total. But I can only use one GPU and it still blows up the very last stage.

Do you mean VRAM or RAM? HDBSCAN is not gpu-accelerated.

I think it crashed at the UMAP stage @MaartenGr :


---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
/tmp/ipykernel_159881/2313223528.py in <module>
      1 #topics, probs = topic_model.fit_transform(docs)
      2 
----> 3 topic_model = topic_model.fit(docs, embeddings)

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/bertopic/_bertopic.py in fit(self, documents, embeddings, y)
    210         ```
    211         """
--> 212         self.fit_transform(documents, embeddings, y)
    213         return self
    214 

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y)
    286         if self.seed_topic_list is not None and self.embedding_model is not None:
    287             y, embeddings = self._guided_topic_modeling(embeddings)
--> 288         umap_embeddings = self._reduce_dimensionality(embeddings, y)
    289 
    290         # Cluster UMAP embeddings with HDBSCAN

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/bertopic/_bertopic.py in _reduce_dimensionality(self, embeddings, y)
   1364                                    low_memory=self.low_memory).fit(embeddings, y=y)
   1365         else:
-> 1366             self.umap_model.fit(embeddings, y=y)
   1367         umap_embeddings = self.umap_model.transform(embeddings)
   1368         logger.info("Reduced dimensionality with UMAP")

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/umap_.py in fit(self, X, y)
   2551 
   2552         if self.transform_mode == "embedding":
-> 2553             self.embedding_, aux_data = self._fit_embed_data(
   2554                 self._raw_data[index], n_epochs, init, random_state,  # JH why raw data?
   2555             )

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/umap_.py in _fit_embed_data(self, X, n_epochs, init, random_state)
   2578         replaced by subclasses.
   2579         """
-> 2580         return simplicial_set_embedding(
   2581             X,
   2582             self.graph_,

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/umap_.py in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose)
   1052     elif isinstance(init, str) and init == "spectral":
   1053         # We add a little noise to avoid local minima for optimization to come
-> 1054         initialisation = spectral_layout(
   1055             data,
   1056             graph,

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/spectral.py in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
    299 
    300     if n_components > 1:
--> 301         return multi_component_layout(
    302             data,
    303             graph,

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/umap/spectral.py in multi_component_layout(data, graph, n_components, component_labels, dim, random_state, metric, metric_kwds)
    236         num_lanczos_vectors = max(2 * k + 1, int(np.sqrt(component_graph.shape[0])))
    237         try:
--> 238             eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
    239                 L,
    240                 k,

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
   1682             raise ValueError("unrecognized mode '%s'" % mode)
   1683 
-> 1684     params = _SymmetricArpackParams(n, k, A.dtype.char, matvec, mode,
   1685                                     M_matvec, Minv_matvec, sigma,
   1686                                     ncv, v0, maxiter, which, tol)

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py in __init__(self, n, k, tp, matvec, mode, M_matvec, Minv_matvec, sigma, ncv, v0, maxiter, which, tol)
    510             raise ValueError("k must be less than ndim(A), k=%d" % k)
    511 
--> 512         _ArpackParams.__init__(self, n, k, tp, mode, sigma,
    513                                ncv, v0, maxiter, which, tol)
    514 

/rds/user/jw983/hpc-work/bertcpu_env/lib/python3.9/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py in __init__(self, n, k, tp, mode, sigma, ncv, v0, maxiter, which, tol)
    340         ncv = min(ncv, n)
    341 
--> 342         self.v = np.zeros((n, ncv), tp)  # holds Ritz vectors
    343         self.iparam = np.zeros(11, arpack_int)
    344 

MemoryError: Unable to allocate 294. GiB for an array with shape (11580087, 3402) and data type float64

ginward avatar Sep 26 '21 15:09 ginward

@MaartenGr What I don't understand is why the dimensionality of the UMAP matrix is (11580087, 3402). I understand that 11580087 is the document size, but shouldn't the size of the embedding be 382? How come it is 3402?

ginward avatar Sep 27 '21 02:09 ginward