Micro-Expression-with-Deep-Learning icon indicating copy to clipboard operation
Micro-Expression-with-Deep-Learning copied to clipboard

Out of memory error with GTX 1080 Ti when training

Open mgmverburg opened this issue 6 years ago • 4 comments

I am running into an issue when trying to run the code in your repo.

I have downloaded the full source code, changed the root_db_path to where my folder with the "CASME2_Optical" folder structure is located. Then I try to run the below: python main.py --dB 'CASME2_Optical' --batch_size=1 --spatial_epochs=1 --temporal_epochs=1 --train_id='test20' --spatial_size=224 --flag='st' This is exactly the same command as the example in your README, except having reduced the spatial_epochs and temporal_epochs to 10 for the sake of reducing the time it takes to run.

However, I always run into OOM error at some point during the process. I have tried reducing the LSTM in "temporal_module" in models.py from 3000 to 300, which made it so that it will be able to train a few more subjects before the OOM occurs. For example just now it was about to start training subject 10 when it ran out of memory when I had the LSTM reduced to 300, but if it is on the default setting with 3000, it doesnt even get past subject 3.

So I am wondering what hardware you might be using to run this, or what else I could be doing wrong?

Here is the full output of my command prompt of when running this with the defaults (e.g. LSTM still 3000) and with the abovementioned command: full_stack_trace.txt

Any help is appreciated!

mgmverburg avatar Sep 10 '18 08:09 mgmverburg

I think I have found the issue. It would seem that the recreation of the models with each iteration of the subjects keeps adding into memory, and that simply deleting the variable at the end of the loop and the garbage collection is not really clearing the memory. Because when instantiating the model outside the for-loop, the OOM error itself does not occur (but then it probably wouldn't work as intended). So instead right before the garbage collection, I used clear_session from keras, which works, but then some other things like the adam optimizer also need to be initialized inside the for-loop. I'm not sure if this is a correct solution or not.

mgmverburg avatar Sep 10 '18 14:09 mgmverburg

@mgmverburg hi . I want to know how you modified it in the code.

13293824182 avatar Sep 19 '18 07:09 13293824182

@13293824182 @mgmverburg I'v solve the problem! I write "K.clear_session()" before "gc.collect()"(about line 520 in train.py) and move "adam = optimizers.Adam(lr = 0.00001,decay = 0.000001)" (about line 190 in train.py)to the "for" loop then it works! Hope it helps!

SunBoWei95 avatar Nov 27 '18 13:11 SunBoWei95

@ILoveXuXin Thanks for the effort and kind sharing. Hope it helps for the rest. :)

IcedDoggie avatar Nov 28 '18 08:11 IcedDoggie