darkflow icon indicating copy to clipboard operation
darkflow copied to clipboard

yolov1 OOM in local layer

Open donglinjy opened this issue 8 years ago • 11 comments

Train with yolov1, OOM happens in local layer because trying to allocate tensor [49,3,3,1024,256] which seems a very big one. Is this working as design?

flow --model cfg/v1.1/myyolov1.cfg --train --dataset ~/data/JPEGImages --annotation ~/dataAnnotations --gpu 1

W tensorflow/core/common_runtime/bfc_allocator.cc:274] *******************************x W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 441.00MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[49,3,3,1024,256] E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 2.12G (2279421952 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

...which was originally created as op u'strided_slice_6', defined at: File "/usr/bin/flow", line 6, in exec(compile(open(file).read(), file, 'exec')) [elided 1 identical lines from previous traceback] File "darkflow/darkflow/cli.py", line 22, in cliHandler tfnet = TFNet(FLAGS) File "darkflow/darkflow/net/build.py", line 75, in init self.build_forward() File "darkflow/darkflow/net/build.py", line 115, in build_forward state = op_create(*args) File "darkflow/darkflow/net/ops/init.py", line 27, in op_create return op_typeslayer_type File "darkflow/darkflow/net/ops/baseop.py", line 42, in init self.forward() File "darkflow/darkflow/net/ops/convolution.py", line 48, in forward kij = k[i * self.lay.w_out + j] File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 722, in _SliceHelperVar return _SliceHelper(var._AsTensor(), slice_spec, var) File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 495, in _SliceHelper name=name) File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 653, in strided_slice shrink_axis_mask=shrink_axis_mask) File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3688, in strided_slice shrink_axis_mask=shrink_axis_mask, name=name) File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def)

...ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[49,3,3,1024,256] [[Node: gradients/strided_slice_6_grad/StridedSliceGrad = StridedSliceGrad[Index=DT_INT32, T=DT_FLOAT, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/strided_slice_6_grad/Shape, strided_slice_6/stack, strided_slice_6/stack_1, strided_slice_6/stack_2, gradients/Conv2D_3_grad/tuple/control_dependency_1)]]

donglinjy avatar Sep 06 '17 09:09 donglinjy

@thtrieu, please help. Thanks a lot in advance.

donglinjy avatar Sep 06 '17 09:09 donglinjy

If remove the gpu flag runing on cpu, there is no OOM error. So is this because yolov1 to big to run on GPU? BTW, I am using Tesla K80.

donglinjy avatar Sep 06 '17 10:09 donglinjy

Maybe you should try to use not your whole gpu memory with --gpu 0.9 or something like this. Another thing could be that you can adjust your batch size with --batch 16, but it should run even if it couldn't allocate enough memory. So i think the first advice should solve the problem. Look in flow --h for more options.

anschwei avatar Sep 06 '17 18:09 anschwei

@Alesthy, thanks for your reply, but I actually tried --gpu 0.9 and even --batch 1, it still get the OOM error. Has any one here actually run darkflow yolov1 on gpu before? According to what I know, [49,3,3,1024,256] is truly a very big tensor and this seems don't even count the batch size dimension in.

class local_layer(Layer): def setup(self, ksize, c, n, stride, pad, w_, h_, activation): self.pad = pad * int(ksize / 2) self.activation = activation self.stride = stride self.ksize = ksize self.h_out = h_ self.w_out = w_ self.dnshape = [h_ * w_, n, c, ksize, ksize] self.wshape = dict({ 'biases': [h_ * w_ * n], 'kernels': [h_ * w_, ksize, ksize, c, n] }) def finalize(self, _): weights = self.w['kernels'] if weights is None: return weights = weights.reshape(self.dnshape) weights = weights.transpose([0,3,4,2,1]) self.w['kernels'] = weights

donglinjy avatar Sep 07 '17 01:09 donglinjy

Any update on this? I'm also getting this error.

rij12 avatar Mar 23 '18 23:03 rij12

i have same problem, using yolov2 OOM error is always occur even i set the batch size into 1 and --gpu 0.01. I am using gforce GTX 750 Ti, CUDA 9.0 and cuDNN 7.0, did you already solve this problem @PigApple ?

andikira avatar Apr 09 '18 04:04 andikira

somebody who solve this error???

linydf avatar Apr 10 '18 03:04 linydf

did you ever tried tiny-yolo ? it's work for me

andikira avatar Apr 10 '18 16:04 andikira

@thtrieu, please help. Thanks a lot in advance.

LuvRC avatar Dec 23 '19 15:12 LuvRC

I also meet this error. And I tested yolov1 (in folder cfg/v1.1) on 2080 Ti (11 GB) and Tesla V100 (16 GB). Unfortunately, the error 'out of memory' is always reported, even with batch size=1. (And I also tried --gpu 0.9 or --gpu 0.5)

@andikira I have tried tiny-yolo, it can train successfully on the 2080 Ti. (And yolo-voc can be trained on this GPU successfully.)

@thtrieu, please help. Thanks a lot in advance.

lyq998 avatar Oct 15 '21 02:10 lyq998

Here is some log information:

Parsing ./cfg/extraction.conv.cfg Parsing cfg/yolov1.cfg Loading ./bin/extraction.conv.weights ... Successfully identified 89721616 bytes Finished in 0.008398056030273438s Model has a VOC model name, loading VOC labels.

Building net ... WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:105: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

Source | Train? | Layer description | Output size -------+--------+----------------------------------+--------------- | | input | (?, 448, 448, 3) Load | Yep! | conv 7x7p3_2 +bnorm leaky | (?, 224, 224, 64) Load | Yep! | maxp 2x2p0_2 | (?, 112, 112, 64) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 112, 112, 192) Load | Yep! | maxp 2x2p0_2 | (?, 56, 56, 192) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 56, 56, 128) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 56, 56, 256) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 56, 56, 256) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 56, 56, 512) Load | Yep! | maxp 2x2p0_2 | (?, 28, 28, 512) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 28, 28, 256) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 28, 28, 512) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 28, 28, 256) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 28, 28, 512) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 28, 28, 256) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 28, 28, 512) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 28, 28, 256) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 28, 28, 512) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 28, 28, 512) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 28, 28, 1024) Load | Yep! | maxp 2x2p0_2 | (?, 14, 14, 1024) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 14, 14, 512) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 14, 14, 1024) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 14, 14, 512) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 14, 14, 1024) Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 14, 14, 1024) Init | Yep! | conv 3x3p1_2 +bnorm leaky | (?, 7, 7, 1024) Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 7, 7, 1024) Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 7, 7, 1024) Init | Yep! | loca 3x3p1_1 leaky | (?, 7, 7, 256) Load | Yep! | flat | (?, 12544) Init | Yep! | full 12544 x 1715 linear | (?, 1715) -------+--------+----------------------------------+--------------- GPU mode with 1.0 usage WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:132: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

cfg/yolov1.cfg loss hyper-parameters: side = 7 box = 3 classes = 20 scales = [1.0, 1.0, 0.5, 5.0] WARNING:tensorflow:From /root/darkflow-master/darkflow/net/yolo/train.py:67: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Building cfg/yolov1.cfg loss WARNING:tensorflow:From /root/darkflow-master/darkflow/net/yolo/train.py:92: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

Building cfg/yolov1.cfg train op WARNING:tensorflow:From /usr/local/python3.7.5/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /usr/local/python3.7.5/lib/python3.7/site-packages/tensorflow_core/python/training/rmsprop.py:119: calling Ones.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:145: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:145: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

............

2021-10-14 16:31:04.891447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2021-10-14 16:31:04.891459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2021-10-14 16:31:04.891560: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-14 16:31:04.892322: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-14 16:31:04.893037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 16130 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:0d.0, compute capability: 7.0) WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:146: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

2021-10-14 16:31:17.111913: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 15.75G (16914055168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:149: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:149: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

Finished in 30.13685703277588s

Enter training ...

cfg/yolov1.cfg parsing ./VOCdevkit/ANN/ Parsing for ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'] [====================>]100% 2009_002584.xml Statistics:

............

Dataset of 22136 instance(s) Training statistics: Learning rate : 0.001 Batch size : 1 Epoch number : 135 Backup every : 20000 2021-10-14 16:31:28.460462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-10-14 16:31:28.800983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-10-14 16:31:30.756172: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.58G (1691406336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2021-10-14 16:31:30.756922: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.58G (1691406336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

lyq998 avatar Oct 15 '21 03:10 lyq998