Top2Vec icon indicating copy to clipboard operation
Top2Vec copied to clipboard

Using a pre-trained doc2vec model

Open SjoerdBraaksma opened this issue 2 years ago • 1 comments

Hello!

I have a text mining use-case with one overarching document set, consisting of many smaller sub-sets of documents. i want to train a topic model for each smaller sub-set of documents, but these sometimes don't contain enough documents on their own. Besides, I would rather use the knowledge of the entire document set to then build a topic model for each sub set.

Training SBERT encoder on this dataset works, but does not provide a tangible improvement over the standard doc2vec option, and takes very long.

So i was wondering: Is there a way to train a doc2vec model on the entire document set, then use that in top2vec to build a topic model for each subset of documents, instead of building a doc2vec model from scratch each time? And maybe there are other options i am not aware of?

SjoerdBraaksma avatar Feb 17 '23 09:02 SjoerdBraaksma

You can train a doc2vec model on your whole dataset then pass it as a callable to embedding_model.

ddangelov avatar Mar 14 '23 22:03 ddangelov