add_documents: The scale between training set and test set

Open Frederikmh90 opened this issue 3 years ago • 1 comments

Thanks for a wonderful package for both topic modeling and clustering tasks. A huge huge step-up from earlier packages. I'm currently trying to fit trained model of 1.000.000 social media posts to a larger dataset of app. 5.000.000 posts using the add_documents()-function. I have trained the model with the 'universal-sentence-encoder-multilingual-large/3' embeddings as my dataset is multi-lingual. Under the 'add_documents' in the documentation, I've come across the following line:

"If adding a large quantity of documents relative to the current model size, or documents containing a largely new vocabulary, a new model should be trained for best results." https://top2vec.readthedocs.io/_/downloads/en/stable/pdf/

I'm curious about the background for this reservation and whether it is possible to calculate if the scale between training and test dataset becomes too large?

Apr 06 '22 09:04 Frederikmh90

If the subset of your data is a random sample and it is greater than ~10% you should have no problem. The issue is when the topics are formed on documents that are not representative of documents being added later on.

Apr 11 '22 15:04 ddangelov