tensorflow-deeplab-v3-plus icon indicating copy to clipboard operation
tensorflow-deeplab-v3-plus copied to clipboard

Getting OOM error when using evaluate script.

Open deep-unlearn opened this issue 6 years ago • 7 comments

Hello,

I am getting an Out of Memory Error (OOM) when using the evaluate script. I have included all info in my stack overflow question here .

Any help will be appreciated.

deep-unlearn avatar Jun 25 '18 15:06 deep-unlearn

Hi @deep-unlearn , thank you for your interest in the repo.

I'm not sure what cased the out of memory error. Did you try running the evaluate script with GPU disabled? In theory, the evaluate script should run without GPU, though it takes longer time. Also, I noticed you are using python 2.7, with which I never tested. I only tested the repo with python 3 and this difference might cause the out of memory error.

I hope this can help you

rishizek avatar Jun 26 '18 15:06 rishizek

Hello, Indeed I tried with no GPU enabled (CPU mode), however inference result is wrong. I guess this is caused due to the fact that training is on GPU hence model cannot work correctly on CPU. Probably training and inference have to be done using same mode. Training on CPU is not an option !

I have also tried python3.6 however same error occurs. Do you think it may due to memory leaking somehow ? When I use a single instance (1-label and 1-image) the code works fine and result is correct. Obviously the problem is somehow with larger input data.

I can help you fix/ improve the code but I do not where to start from. Any suggestions ?

deep-unlearn avatar Jun 26 '18 15:06 deep-unlearn

Hi @deep-unlearn ,

Hello, Indeed I tried with no GPU enabled (CPU mode), however inference result is wrong. I guess this is caused due to the fact that training is on GPU hence model cannot work correctly on CPU. Probably training and inference have to be done using same mode. Training on CPU is not an option !

That's strange because I can run inference script correctly without GPU, even though the model is trained on GPU. The training and inference do not have to be done using same mode. I'm curious what kind of error occurred when you run inference with CPU model.

Also which OS and TensorFlow version did you use to run the code?

rishizek avatar Jun 26 '18 16:06 rishizek

Hello,

Ok interesting to know that inference can run on CPU as well (not for me though). With CPU mode I m not getting any error from the system however the outcome is wrong. All classes are predicted as class zero. When I try same script on single image on GPU (this does not produce as error OOM) outcome is perfect.

My system runs Ubuntu 17.10 I have multiple version of tensorflow through virtualenvs

Tested on Tensorflow 1.8 and 1.6 (python3)-- both cases same error I have CUDA 9.0 and CUDNN 6 installed

Which CUDNN version you have installed ? By he way I am testing the system with output_stride=8 which is more computational intensive

deep-unlearn avatar Jun 27 '18 10:06 deep-unlearn

Hi @deep-unlearn ,

I tested Tensorflow1.5,1.6,1.7 and 1.8 with ubuntu 16.04 Regarding cuda and cudnn, I confirm that the model works with cuda9.0 and 9.1, and cudnn 7 and 7.1. Maybe older version of cudnn 6 might be causing the problem. I usually testing model with output_stride=16, but output_stride=8 should work though computationally intensive.

rishizek avatar Jun 27 '18 14:06 rishizek

Hello @rishizek

Thank you for your detailed help. I found out the problem eventually. OOM occurs when I provide a large image to the model (~5000x5000 pixel). I will try to catch the error and tile the image so I can re-ingest it in smaller patches.

Keep you informed on this, may be helpful for you or other users

deep-unlearn avatar Jun 28 '18 12:06 deep-unlearn

Hi @deep-unlearn ,

I see. That makes sense. Thank you for letting me know that!

rishizek avatar Jun 28 '18 13:06 rishizek