SSD-Tensorflow icon indicating copy to clipboard operation
SSD-Tensorflow copied to clipboard

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,64,300,300] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Open sunrise666 opened this issue 6 years ago • 8 comments

2018-07-28 10:35:04.192687: W c:\users\user\source\repos\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[16,64,300,300] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[16,64,300,300] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: ssd_300_vgg/conv1/conv1_1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](fifo_queue_Dequeue/_497, ssd_300_vgg/conv1/conv1_1/weights/read/_495)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: gradients/ssd_300_vgg/block10/conv1x1/BiasAdd_grad/BiasAddGrad/_749 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1185_gradients/ssd_300_vgg/block10/conv1x1/BiasAdd_grad/BiasAddGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'ssd_300_vgg/conv1/conv1_1/Conv2D', defined at: File "D:/SSD-Tensorflow-master-2/train_ssd_network.py", line 411, in tf.app.run() File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run _sys.exit(main(argv)) File "D:/SSD-Tensorflow-master-2/train_ssd_network.py", line 311, in main clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue]) File "D:\SSD-Tensorflow-master-2\deployment\model_deploy.py", line 196, in create_clones outputs = model_fn(*args, **kwargs) File "D:/SSD-Tensorflow-master-2/train_ssd_network.py", line 294, in clone_fn ssd_net.net(b_image, is_training=True) File "D:\SSD-Tensorflow-master-2\nets\ssd_vgg_300.py", line 156, in net scope=scope) File "D:\SSD-Tensorflow-master-2\nets\ssd_vgg_300.py", line 453, in ssd_net net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1') File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 2467, in repeat outputs = layer(outputs, *args, **kwargs) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 183, in func_with_args return func(*args, **current_args) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1049, in convolution outputs = layer.apply(inputs) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\layers\base.py", line 828, in apply return self.call(inputs, *args, **kwargs) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\layers\base.py", line 717, in call outputs = self.call(inputs, *args, **kwargs) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\layers\convolutional.py", line 168, in call outputs = self._convolution_op(inputs, self.kernel) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 868, in call return self.conv_op(inp, filter) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 520, in call return self.call(inp, filter) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 204, in call name=self.name) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 1042, in conv2d data_format=data_format, dilations=dilations, name=name) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op op_def=op_def) File "D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,64,300,300] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: ssd_300_vgg/conv1/conv1_1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](fifo_queue_Dequeue/_497, ssd_300_vgg/conv1/conv1_1/weights/read/_495)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: gradients/ssd_300_vgg/block10/conv1x1/BiasAdd_grad/BiasAddGrad/_749 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1185_gradients/ssd_300_vgg/block10/conv1x1/BiasAdd_grad/BiasAddGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

sunrise666 avatar Jul 28 '18 02:07 sunrise666

how to solve this problem?

sunrise666 avatar Jul 28 '18 02:07 sunrise666

Oh,I am also faced with this problem,can anyone help me?

XuDuoBiao avatar Sep 18 '18 13:09 XuDuoBiao

I have solved this problem. We just need a smaller batch_size,such as 16.

XuDuoBiao avatar Sep 18 '18 13:09 XuDuoBiao

your memory of GPU is not enough, you can change a smaller batch_size

rainley123 avatar Feb 17 '19 07:02 rainley123

I have solved this problem. We just need a smaller batch_size,such as 16.

SENİ COK SEVIYORM.. THAT MEANS I LOVE YOU

bsbilal avatar Jun 14 '20 22:06 bsbilal

I am getting this error while training with batch size = 1 on kaggle. Anyone please help!!!

sourabhyadav999 avatar Mar 29 '21 17:03 sourabhyadav999

I am getting this error while training with batch size = 1 on kaggle. Anyone please help!!!

Me too~ Hope anyone help us

Stellaxu19 avatar May 11 '21 00:05 Stellaxu19

I am getting this error while training with batch size = 1 on kaggle. Anyone please help!!!

Me too~ Hope anyone help us

I solved it by reducing the complexity of the Ann architecture a little bit.

sourabhyadav999 avatar May 11 '21 05:05 sourabhyadav999