Kashgari
Kashgari copied to clipboard
How to train on limited RAM when I have >10million documents
Environment
Ubuntu 16.04, Python 3.7
Question
I have p2.xlarge machine with 60gb ram and 11gb gpu. When I am taking ~0.8 million documents for NER Labelling task >55Gb of RAM is consumed ... but I am able to train.
Now I want to train on all >10million documents ... how can I do that with limited memory available. I am going to try 0.8 mill documents with 4 epochs, then save model and load again with next 0.8 mill data with another 4 epochs ... and so on. Will it help ??
Tried above methods for 2-3 sets but accuracy does not improve. Is there any option for lazy loading or something else ... let me know. Thanks.
Additionally, what is the difference between fit
and fit_without_generator
.
Will it help me in training the way I explained above ?
This is a really cool use-case. I am happy to help out. Actually we can train tons of data with limited RAM, but need to make some changes.
Let's start with the fit
and fit_without_generator
. fit
is equal to keras's fit_with_generator
, a lazy load function that could handle lots of data with limited RAM. fit_without_generator
equals to keras's fit
, slightly faster than fit_with_generator
but cost more RAM.
But why it still need 55Gb RAM with 0.8 million data? It is because you need to load all of the orininal data to RAM so that we could build token and label dict and build model struct. So if we can optimize this part, you can handle all of your data easily.
Let me try something and come back to you a little bit layer. Then you can try it out.
Let's keep it simple first. Which embedding are you using for this task?
Cool. Thanks for the reply. I am using bert embeddings and BiGRU model for labelling ner task.
I have used ImageDataGenerator from keras before, which reads few images in memory not everything. Wanted to check is something like that is possible or not. Also, noob in Tensorflow and keras so not aware how to solve this use case.
Yea, we need to implement something similiar to the ImageDataGenerator. I will try to do that tomorrow and come back to you.
Thanks. That would be a very cool solution .. btw when I load model and try to train again on some other dataset why does it not work ?? When we save model do we save all states .. right??
Hi @BrikerMan .. were you able to do something ? Let me know. I will also try something out in-between. Thanks.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Sorry, it has been very busy for the last several weeks. I will come back to you ASAP.
@allhelllooz could you prepare token-dict and label-dict by yourself?
I should be able to do that. Can you send me format for token-dict and label-dict ?
@allhelllooz Sorry for the long delay. I have started the tf2 version Kashgari V2, which is very RAM-friendly. I have tested a classification task with a 10G corpus, it cost 1G RAM. Please try it out.