tensorflow-deeplab-v3-plus
tensorflow-deeplab-v3-plus copied to clipboard
Getting OOM error when using evaluate script.
Hello,
I am getting an Out of Memory Error (OOM) when using the evaluate script. I have included all info in my stack overflow question here .
Any help will be appreciated.
Hi @deep-unlearn , thank you for your interest in the repo.
I'm not sure what cased the out of memory error. Did you try running the evaluate script with GPU disabled? In theory, the evaluate script should run without GPU, though it takes longer time. Also, I noticed you are using python 2.7, with which I never tested. I only tested the repo with python 3 and this difference might cause the out of memory error.
I hope this can help you
Hello, Indeed I tried with no GPU enabled (CPU mode), however inference result is wrong. I guess this is caused due to the fact that training is on GPU hence model cannot work correctly on CPU. Probably training and inference have to be done using same mode. Training on CPU is not an option !
I have also tried python3.6 however same error occurs. Do you think it may due to memory leaking somehow ? When I use a single instance (1-label and 1-image) the code works fine and result is correct. Obviously the problem is somehow with larger input data.
I can help you fix/ improve the code but I do not where to start from. Any suggestions ?
Hi @deep-unlearn ,
Hello, Indeed I tried with no GPU enabled (CPU mode), however inference result is wrong. I guess this is caused due to the fact that training is on GPU hence model cannot work correctly on CPU. Probably training and inference have to be done using same mode. Training on CPU is not an option !
That's strange because I can run inference script correctly without GPU, even though the model is trained on GPU. The training and inference do not have to be done using same mode. I'm curious what kind of error occurred when you run inference with CPU model.
Also which OS and TensorFlow version did you use to run the code?
Hello,
Ok interesting to know that inference can run on CPU as well (not for me though). With CPU mode I m not getting any error from the system however the outcome is wrong. All classes are predicted as class zero. When I try same script on single image on GPU (this does not produce as error OOM) outcome is perfect.
My system runs Ubuntu 17.10 I have multiple version of tensorflow through virtualenvs
Tested on Tensorflow 1.8 and 1.6 (python3)-- both cases same error I have CUDA 9.0 and CUDNN 6 installed
Which CUDNN version you have installed ? By he way I am testing the system with output_stride=8 which is more computational intensive
Hi @deep-unlearn ,
I tested Tensorflow1.5,1.6,1.7 and 1.8 with ubuntu 16.04 Regarding cuda and cudnn, I confirm that the model works with cuda9.0 and 9.1, and cudnn 7 and 7.1. Maybe older version of cudnn 6 might be causing the problem. I usually testing model with output_stride=16, but output_stride=8 should work though computationally intensive.
Hello @rishizek
Thank you for your detailed help. I found out the problem eventually. OOM occurs when I provide a large image to the model (~5000x5000 pixel). I will try to catch the error and tile the image so I can re-ingest it in smaller patches.
Keep you informed on this, may be helpful for you or other users
Hi @deep-unlearn ,
I see. That makes sense. Thank you for letting me know that!