tomotopy icon indicating copy to clipboard operation
tomotopy copied to clipboard

Ability to stream corpus data to LDAModel (or any other model)

Open jalustig opened this issue 2 years ago • 0 comments

Tomotopy currently loads all of documents before training, and then it trains on these documents.

However, what I find is that I have a very large corpus (about 750,000 documents) and if I want to train on a portion of these documents, I am heavily ram limited. Even loading 20,000 documents will create a situation where my scrip takes up 20GB of ram.

Gensim has the ability to stream an iterable document corpus, which makes it more scalable in terms of ram. Is there a possibility to adjust Tomotopy so that it would have a similar capability that would allow one to train on a larger dataset?

jalustig avatar Feb 27 '22 16:02 jalustig