Jinhua Wang comments

Results 48 comments of


Jinhua Wang

help sought to train a big data sentence model (upto 1.5 million sentences)

It seems that the memory issues occur not in the Sentence Embedding stage or the HMAP stage, but the HDBSCAN stage. I currently have about 10 million short sentences. I...

help sought to train a big data sentence model (upto 1.5 million sentences)

@MaartenGr What if I reduce the UMAP reduced dimension to 2 (in the source code it was set to five originally)? Would that relieve some of the burden that HDBSCAN...

help sought to train a big data sentence model (upto 1.5 million sentences)

@MaartenGr Thanks. What is the maximum number of sentences that the model can handle from your experience?

help sought to train a big data sentence model (upto 1.5 million sentences)

@MaartenGr Is there also a way to separate the process of sentence embedding, UMAP and HDBSCAN by saving the intermediary models? If the memory blows up at the last stage...

help sought to train a big data sentence model (upto 1.5 million sentences)

> This is a difficult question to answer since it highly depends on your hardware specs. A free google colab session handles a couple of hundred thousand sentences without issues...

help sought to train a big data sentence model (upto 1.5 million sentences)

I am not sure if a single GPU card can use all the 96GB RAM available in the machine, as the other 48GB is in the other three GPU cards....

help sought to train a big data sentence model (upto 1.5 million sentences)

@MaartenGr If only the sentence transformer part is done on GPU, can I train the embeddings first and then run the other parts on a machine with only CPU access?...

help sought to train a big data sentence model (upto 1.5 million sentences)

@MaartenGr It is 96GB RAM and 16GB VRAM. Apparently 96GB RAM is not enough to get the 10 million sentences done. I am using a customised dataset, but the code...

help sought to train a big data sentence model (upto 1.5 million sentences)

@MaartenGr Would setting `ngram_range=(1, 1)` help though? It might reduce the TF-IDF matrix size.

help sought to train a big data sentence model (upto 1.5 million sentences)

> You can try to embed the sentences beforehand by following [this](https://maartengr.github.io/BERTopic/tutorial/embeddings/embeddings.html#custom-embeddings) piece of documentation. After that, you can simply save the embeddings and load them in when necessary. There...