tomotopy
tomotopy copied to clipboard
Ability to stream corpus data to LDAModel (or any other model)
Tomotopy currently loads all of documents before training, and then it trains on these documents.
However, what I find is that I have a very large corpus (about 750,000 documents) and if I want to train on a portion of these documents, I am heavily ram limited. Even loading 20,000 documents will create a situation where my scrip takes up 20GB of ram.
Gensim has the ability to stream an iterable document corpus, which makes it more scalable in terms of ram. Is there a possibility to adjust Tomotopy so that it would have a similar capability that would allow one to train on a larger dataset?