models
models copied to clipboard
Resource exhausted: OOM when allocating tensor with shape[32,960,10,10]
Error got during object detection training with ssd_mobilenet_v2_quantized_300x300_coco model.
I am running below command to start the training:
python ../../models/research/object_detection/model_main.py --pipeline_config_path=./ssd_mobilenet_v2_quantized_300x300_coco.config --model_dir=./training/ --num_train_steps=2000000 --sample_1_of_n_eval_examples=1 --alsologtostderr
Training was going fine till 47900 steps, after that i got error:
File "/home/saini/.virtualenvs/cv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[32,960,10,10] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node gradients/AddN_162-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Loss/Cast_232/_16919]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[32,960,10,10] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node gradients/AddN_162-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
OS: Ubuntu 18.04 Tensorflow: 1.14.0 GPU CUDA: 10.0 CUDNN:: 7.6 Batch Size: 32
Follow changes i have made in default TF object detection API:
model_lib.py
tf.estimator.EvalSpec(
name=eval_spec_name,
input_fn=eval_input_fn,
steps=None,
throttle_secs = 172800,
exporters=exporter))
eval.proto
optional uint32 eval_interval_secs = 3 [default = 172800]; # default = 600
model_main.py
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, save_checkpoints_steps=5000)
Note : The error occurred after 47900 steps. My question is why after 47900 steps there is an error. Why not at the initial steps?
OOM means out of memory. Maybe there is a case that at initial steps, memory might be empty, but when your steps go high, it uses more memory. I think that's why it happen.
@VismayTandel Yes you are right: OOM means out of memory. If Images size and Batch size is same throughout the training. So how can be possible later steps training needs more memory. Further, as i am not at all running any other program which takes the GPU memory. That's why for me its quite strange GPU goes out of memory in between training.
I change the batch size to 16, then all works good
Same problem here, tried with smaller batch_size
and/or learning_rate
but still doesn't work. Gives me the same error.
They should give a small example data-sets with a config file with each official model so that we would be at least able to know it's not going anywhere before waiting for 3~4 hours.
I am getting the same error:
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_generator_v1.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing)
591 shuffle=shuffle,
592 initial_epoch=initial_epoch,
--> 593 steps_name='steps_per_epoch')
594
595 def evaluate(self,
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_generator_v1.py in model_iteration(model, data, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch, mode, batch_size, steps_name, **kwargs)
257
258 is_deferred = not model._is_compiled
--> 259 batch_outs = batch_function(*batch_data)
260 if not isinstance(batch_outs, list):
261 batch_outs = [batch_outs]
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_v1.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics)
1086 self._update_sample_weight_modes(sample_weights=sample_weights)
1087 self._make_train_function()
-> 1088 outputs = self.train_function(ins) # pylint: disable=not-callable
1089
1090 if reset_metrics:
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
3955
3956 fetched = self._callable_fn(*array_vals,
-> 3957 run_metadata=self.run_metadata)
3958 self._call_fetch_callbacks(fetched[-len(self._fetches):])
3959 output_structure = nest.pack_sequence_as(
/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
1480 ret = tf_session.TF_SessionRunCallable(self._session._session,
1481 self._handle, args,
-> 1482 run_metadata_ptr)
1483 if run_metadata:
1484 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
ResourceExhaustedError: OOM when allocating tensor with shape[32768,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_46/Adam/Adam/update_dense_22/kernel/ResourceApplyAdam}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Does any one have any idea? How come later steps of the training could consume more memory?
@VismayTandel Yes you are right: OOM means out of memory. If Images size and Batch size is same throughout the training. So how can be possible later steps training needs more memory. Further, as i am not at all running any other program which takes the GPU memory. That's why for me its quite strange GPU goes out of memory in between training.
thanks, I changed the batch size to 16 and my problem is solved
I have the same error messages: (I am running tensorflow 2.4 1)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,64,224,896] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model_1/batch_normalization_8/FusedBatchNormV3 (defined at /opt/conda/lib/python3.7/site-packages/mlrun/frameworks/keras/mlrun_interface.py:123) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_train_function_4513]
I had same issue..reduced batch size and tried and it was same ..tried setting allow_growth..still same..on rebooting the systems..the problem went away..but on killing training in between and starting the training again...made the issue rebounce
I was trying to train a Vision Transformer on CIFAR-100 dataset.
GPU: GTX 1650 w/ 4GB vRAM
Initially, I had the batch_size set to 256, which was totally insane for such a GPU, and I was getting the same OOM error.
I tweaked it to batch_size = 16, training works perfectly fine.
So, always choose a smaller batch_size if you are training on laptops or mid-range GPUs.
Hope this helps!
I was trying to train an auto encoder and encountered the same error even with smaller batches, but prefetching the data made the error go away . 😄
@sainisanjay Have you found any solution for this issue? I ran a DNN program with both TensorFlow and PyTorch and it works fine with PyTorch but TensorFlow throws an OOM error after a few episodes! The program works fine in TensorFlow if the batch size is reduced to 16, but it feels like a bummer.