models icon indicating copy to clipboard operation
models copied to clipboard

Resource exhausted: OOM when allocating tensor with shape[32,960,10,10]

Open sainisanjay opened this issue 4 years ago • 12 comments

Error got during object detection training with ssd_mobilenet_v2_quantized_300x300_coco model. I am running below command to start the training: python ../../models/research/object_detection/model_main.py --pipeline_config_path=./ssd_mobilenet_v2_quantized_300x300_coco.config --model_dir=./training/ --num_train_steps=2000000 --sample_1_of_n_eval_examples=1 --alsologtostderr Training was going fine till 47900 steps, after that i got error:

File "/home/saini/.virtualenvs/cv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[32,960,10,10] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node gradients/AddN_162-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Loss/Cast_232/_16919]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[32,960,10,10] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node gradients/AddN_162-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

OS: Ubuntu 18.04 Tensorflow: 1.14.0 GPU CUDA: 10.0 CUDNN:: 7.6 Batch Size: 32

Follow changes i have made in default TF object detection API:

model_lib.py
tf.estimator.EvalSpec(
            name=eval_spec_name,
            input_fn=eval_input_fn,
            steps=None,
            throttle_secs = 172800,
            exporters=exporter))
eval.proto
optional uint32 eval_interval_secs = 3 [default = 172800]; # default = 600
model_main.py
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, save_checkpoints_steps=5000)

sainisanjay avatar May 11 '20 04:05 sainisanjay

Note : The error occurred after 47900 steps. My question is why after 47900 steps there is an error. Why not at the initial steps?

sainisanjay avatar May 11 '20 05:05 sainisanjay

OOM means out of memory. Maybe there is a case that at initial steps, memory might be empty, but when your steps go high, it uses more memory. I think that's why it happen.

VismayTandel avatar May 11 '20 05:05 VismayTandel

@VismayTandel Yes you are right: OOM means out of memory. If Images size and Batch size is same throughout the training. So how can be possible later steps training needs more memory. Further, as i am not at all running any other program which takes the GPU memory. That's why for me its quite strange GPU goes out of memory in between training.

sainisanjay avatar May 11 '20 05:05 sainisanjay

I change the batch size to 16, then all works good

alvianihza avatar Jun 25 '20 08:06 alvianihza

Same problem here, tried with smaller batch_size and/or learning_rate but still doesn't work. Gives me the same error.

They should give a small example data-sets with a config file with each official model so that we would be at least able to know it's not going anywhere before waiting for 3~4 hours.

chudur-budur avatar Mar 29 '21 02:03 chudur-budur

I am getting the same error:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_generator_v1.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing)
    591         shuffle=shuffle,
    592         initial_epoch=initial_epoch,
--> 593         steps_name='steps_per_epoch')
    594 
    595   def evaluate(self,

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_generator_v1.py in model_iteration(model, data, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch, mode, batch_size, steps_name, **kwargs)
    257 
    258       is_deferred = not model._is_compiled
--> 259       batch_outs = batch_function(*batch_data)
    260       if not isinstance(batch_outs, list):
    261         batch_outs = [batch_outs]

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_v1.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics)
   1086       self._update_sample_weight_modes(sample_weights=sample_weights)
   1087       self._make_train_function()
-> 1088       outputs = self.train_function(ins)  # pylint: disable=not-callable
   1089 
   1090     if reset_metrics:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
   3955 
   3956     fetched = self._callable_fn(*array_vals,
-> 3957                                 run_metadata=self.run_metadata)
   3958     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3959     output_structure = nest.pack_sequence_as(

/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1480         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1481                                                self._handle, args,
-> 1482                                                run_metadata_ptr)
   1483         if run_metadata:
   1484           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: OOM when allocating tensor with shape[32768,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node training_46/Adam/Adam/update_dense_22/kernel/ResourceApplyAdam}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Does any one have any idea? How come later steps of the training could consume more memory?

ya332 avatar Apr 08 '21 04:04 ya332

@VismayTandel Yes you are right: OOM means out of memory. If Images size and Batch size is same throughout the training. So how can be possible later steps training needs more memory. Further, as i am not at all running any other program which takes the GPU memory. That's why for me its quite strange GPU goes out of memory in between training.

thanks, I changed the batch size to 16 and my problem is solved

jafarMajidpour avatar Jun 11 '21 15:06 jafarMajidpour

I have the same error messages: (I am running tensorflow 2.4 1) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,64,224,896] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model_1/batch_normalization_8/FusedBatchNormV3 (defined at /opt/conda/lib/python3.7/site-packages/mlrun/frameworks/keras/mlrun_interface.py:123) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_train_function_4513]

xsqian avatar Oct 21 '21 04:10 xsqian

I had same issue..reduced batch size and tried and it was same ..tried setting allow_growth..still same..on rebooting the systems..the problem went away..but on killing training in between and starting the training again...made the issue rebounce

KaviyaSubramanian706 avatar Dec 08 '21 07:12 KaviyaSubramanian706

I was trying to train a Vision Transformer on CIFAR-100 dataset.

GPU: GTX 1650 w/ 4GB vRAM

Initially, I had the batch_size set to 256, which was totally insane for such a GPU, and I was getting the same OOM error.

I tweaked it to batch_size = 16, training works perfectly fine.

So, always choose a smaller batch_size if you are training on laptops or mid-range GPUs.

Hope this helps!

pranavdurai10 avatar Jun 15 '22 10:06 pranavdurai10

I was trying to train an auto encoder and encountered the same error even with smaller batches, but prefetching the data made the error go away . 😄

Pun-it avatar Jan 14 '24 12:01 Pun-it

@sainisanjay Have you found any solution for this issue? I ran a DNN program with both TensorFlow and PyTorch and it works fine with PyTorch but TensorFlow throws an OOM error after a few episodes! The program works fine in TensorFlow if the batch size is reduced to 16, but it feels like a bummer.

febinmathew avatar Mar 22 '24 18:03 febinmathew