image-segmentation-keras icon indicating copy to clipboard operation
image-segmentation-keras copied to clipboard

16g gpu memory still got run out of memory error

Open pgr2015 opened this issue 6 years ago • 8 comments
trafficstars

Hi, I tried to run the command like

python -m keras_segmentation train
--checkpoints_path="path_to_checkpoints"
--train_images="dataset1/images_prepped_train/"
--train_annotations="dataset1/annotations_prepped_train/"
--val_images="dataset1/images_prepped_test/"
--val_annotations="dataset1/annotations_prepped_test/"
--n_classes=50
--input_height=320
--input_width=640
--model_name="fcn_8_resnet50"

I tried several gpu(GTX 960M with 2G memory, Quardo P2000 with 4G memory, Tesla P100 with 16G memory), but all failed with error out of memory.

I also tried to reduce the image size and batch size, however it did not help.

Could anyone give me a hint about this issue?

pgr2015 avatar Jul 24 '19 10:07 pgr2015

I deleted tensorflow-gpu and reinstalled tensorflow so without gpu, everything works properly.

I think, maybe some gpu settings are not correct.......

pgr2015 avatar Jul 24 '19 12:07 pgr2015

I got this problem on another project. I tried the same code with tensorflow-gpu 2.0 and the 1.14. With the beta version I got problems with memory, while with the second one everything worked. No idea about the reason of that, but with the same exact code it worked. Giving up on the gpu boost is not the best idea imo.

Mikelainz avatar Jul 24 '19 12:07 Mikelainz

@pgr2015 which tensorfow version you using? In case you haven't, please make sure all the memory is free ( using nvidia-smi ) before running the program.

divamgupta avatar Jul 24 '19 18:07 divamgupta

@divamgupta Hi, sorry for the late reply, I am using tensorflow 1.11.0, and GPU memory is free.

pgr2015 avatar Jul 26 '19 08:07 pgr2015

@pgr2015 which tensorfow version you using? In case you haven't, please make sure all the memory is free ( using nvidia-smi ) before running the program.

Hi, which tensorfow version should be used? I am using tf 1.8.0, but GPU memory is free,so the training process is slow

genghuan2005 avatar Aug 20 '19 10:08 genghuan2005

yeah, without GPU , it becomes too slow (I use 1.14, as 1.11.0 doesn't work), so i wonder which tf verison you use @divamgupta.. thanks a lot

SmaleZ avatar May 11 '20 19:05 SmaleZ

I also tried segnet_resnet50,pspnet_resnet50. they work well. but FCN_8_resnet or FCN_32_resnet both didn't work error message as follows: 2020-05-11 22:30:04.095153: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_ops.cc:880 : Resource exhausted: OOM when allocating tensor with shape[4096,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "test_fcn.py", line 19, in checkpoints_path = "./weight/fcn_8_resnet50" , epochs=2 File "/home/zzhang/zhege/image-segmentation-keras/keras_segmentation/train.py", line 157, in train epochs=epochs, callbacks=callbacks) File "/home/zzhang/anaconda3/envs/keras_segmentation_1/lib/python3.6/site-packages/Keras-2.3.1-py3.6.egg/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/home/zzhang/anaconda3/envs/keras_segmentation_1/lib/python3.6/site-packages/Keras-2.3.1-py3.6.egg/keras/engine/training.py", line 1732, in fit_generator initial_epoch=initial_epoch) File "/home/zzhang/anaconda3/envs/keras_segmentation_1/lib/python3.6/site-packages/Keras-2.3.1-py3.6.egg/keras/engine/training_generator.py", line 220, in fit_generator reset_metrics=False) File "/home/zzhang/anaconda3/envs/keras_segmentation_1/lib/python3.6/site-packages/Keras-2.3.1-py3.6.egg/keras/engine/training.py", line 1514, in train_on_batch outputs = self.train_function(ins) File "/home/zzhang/anaconda3/envs/keras_segmentation_1/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3292, in call run_metadata=self.run_metadata) File "/home/zzhang/anaconda3/envs/keras_segmentation_1/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in call run_metadata_ptr) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[4096,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node conv2d_1/convolution}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Mean/_3405]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[4096,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node conv2d_1/convolution}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

SmaleZ avatar May 11 '20 20:05 SmaleZ

Same issue even with pspnet_50

HoseinHashemi avatar Mar 21 '24 04:03 HoseinHashemi