Top2Vec icon indicating copy to clipboard operation
Top2Vec copied to clipboard

add_documents: The scale between training set and test set

Open Frederikmh90 opened this issue 3 years ago • 1 comments

Thanks for a wonderful package for both topic modeling and clustering tasks. A huge huge step-up from earlier packages. I'm currently trying to fit trained model of 1.000.000 social media posts to a larger dataset of app. 5.000.000 posts using the add_documents()-function. I have trained the model with the 'universal-sentence-encoder-multilingual-large/3' embeddings as my dataset is multi-lingual. Under the 'add_documents' in the documentation, I've come across the following line:

"If adding a large quantity of documents relative to the current model size, or documents containing a largely new vocabulary, a new model should be trained for best results." https://top2vec.readthedocs.io/_/downloads/en/stable/pdf/

I'm curious about the background for this reservation and whether it is possible to calculate if the scale between training and test dataset becomes too large?

Frederikmh90 avatar Apr 06 '22 09:04 Frederikmh90

If the subset of your data is a random sample and it is greater than ~10% you should have no problem. The issue is when the topics are formed on documents that are not representative of documents being added later on.

ddangelov avatar Apr 11 '22 15:04 ddangelov