Kashgari How to train on limited RAM when I have >10million documents

How to train on limited RAM when I have >10million documents

Open allhelllooz opened this issue 4 years ago • 12 comments

Environment

Ubuntu 16.04, Python 3.7

Question

I have p2.xlarge machine with 60gb ram and 11gb gpu. When I am taking ~0.8 million documents for NER Labelling task >55Gb of RAM is consumed ... but I am able to train.

Now I want to train on all >10million documents ... how can I do that with limited memory available. I am going to try 0.8 mill documents with 4 epochs, then save model and load again with next 0.8 mill data with another 4 epochs ... and so on. Will it help ??

Tried above methods for 2-3 sets but accuracy does not improve. Is there any option for lazy loading or something else ... let me know. Thanks.

Oct 19 '19 10:10 allhelllooz

Additionally, what is the difference between fit and fit_without_generator. Will it help me in training the way I explained above ?

Oct 19 '19 11:10 allhelllooz

This is a really cool use-case. I am happy to help out. Actually we can train tons of data with limited RAM, but need to make some changes.

Let's start with the fit and fit_without_generator. fit is equal to keras's fit_with_generator, a lazy load function that could handle lots of data with limited RAM. fit_without_generator equals to keras's fit, slightly faster than fit_with_generator but cost more RAM.

But why it still need 55Gb RAM with 0.8 million data? It is because you need to load all of the orininal data to RAM so that we could build token and label dict and build model struct. So if we can optimize this part, you can handle all of your data easily.

Let me try something and come back to you a little bit layer. Then you can try it out.

Oct 19 '19 12:10 BrikerMan

Let's keep it simple first. Which embedding are you using for this task?

Oct 19 '19 12:10 BrikerMan

Cool. Thanks for the reply. I am using bert embeddings and BiGRU model for labelling ner task.

I have used ImageDataGenerator from keras before, which reads few images in memory not everything. Wanted to check is something like that is possible or not. Also, noob in Tensorflow and keras so not aware how to solve this use case.

Oct 19 '19 13:10 allhelllooz

Yea, we need to implement something similiar to the ImageDataGenerator. I will try to do that tomorrow and come back to you.

Oct 19 '19 13:10 BrikerMan

Thanks. That would be a very cool solution .. btw when I load model and try to train again on some other dataset why does it not work ?? When we save model do we save all states .. right??

Oct 19 '19 13:10 allhelllooz

Hi @BrikerMan .. were you able to do something ? Let me know. I will also try something out in-between. Thanks.

Oct 22 '19 08:10 allhelllooz

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 11 '19 09:11 stale[bot]

Sorry, it has been very busy for the last several weeks. I will come back to you ASAP.

Nov 11 '19 09:11 BrikerMan

@allhelllooz could you prepare token-dict and label-dict by yourself?

Nov 12 '19 03:11 BrikerMan

I should be able to do that. Can you send me format for token-dict and label-dict ?

Nov 12 '19 13:11 allhelllooz

@allhelllooz Sorry for the long delay. I have started the tf2 version Kashgari V2, which is very RAM-friendly. I have tested a classification task with a 10G corpus, it cost 1G RAM. Please try it out.

Mar 16 '20 10:03 BrikerMan

Kashgari Kashgari copied to clipboard

How to train on limited RAM when I have >10million documents

Environment

Question

Kashgari
Kashgari copied to clipboard