darkflow
darkflow copied to clipboard
yolov1 OOM in local layer
Train with yolov1, OOM happens in local layer because trying to allocate tensor [49,3,3,1024,256] which seems a very big one. Is this working as design?
flow --model cfg/v1.1/myyolov1.cfg --train --dataset ~/data/JPEGImages --annotation ~/dataAnnotations --gpu 1
W tensorflow/core/common_runtime/bfc_allocator.cc:274] *******************************x W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 441.00MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[49,3,3,1024,256] E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 2.12G (2279421952 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
...which was originally created as op u'strided_slice_6', defined at:
File "/usr/bin/flow", line 6, in
...ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[49,3,3,1024,256] [[Node: gradients/strided_slice_6_grad/StridedSliceGrad = StridedSliceGrad[Index=DT_INT32, T=DT_FLOAT, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/strided_slice_6_grad/Shape, strided_slice_6/stack, strided_slice_6/stack_1, strided_slice_6/stack_2, gradients/Conv2D_3_grad/tuple/control_dependency_1)]]
@thtrieu, please help. Thanks a lot in advance.
If remove the gpu flag runing on cpu, there is no OOM error. So is this because yolov1 to big to run on GPU? BTW, I am using Tesla K80.
Maybe you should try to use not your whole gpu memory with --gpu 0.9 or something like this. Another thing could be that you can adjust your batch size with --batch 16, but it should run even if it couldn't allocate enough memory. So i think the first advice should solve the problem. Look in flow --h for more options.
@Alesthy, thanks for your reply, but I actually tried --gpu 0.9 and even --batch 1, it still get the OOM error. Has any one here actually run darkflow yolov1 on gpu before? According to what I know, [49,3,3,1024,256] is truly a very big tensor and this seems don't even count the batch size dimension in.
class local_layer(Layer): def setup(self, ksize, c, n, stride, pad, w_, h_, activation): self.pad = pad * int(ksize / 2) self.activation = activation self.stride = stride self.ksize = ksize self.h_out = h_ self.w_out = w_ self.dnshape = [h_ * w_, n, c, ksize, ksize] self.wshape = dict({ 'biases': [h_ * w_ * n], 'kernels': [h_ * w_, ksize, ksize, c, n] }) def finalize(self, _): weights = self.w['kernels'] if weights is None: return weights = weights.reshape(self.dnshape) weights = weights.transpose([0,3,4,2,1]) self.w['kernels'] = weights
Any update on this? I'm also getting this error.
i have same problem, using yolov2 OOM error is always occur even i set the batch size into 1 and --gpu 0.01. I am using gforce GTX 750 Ti, CUDA 9.0 and cuDNN 7.0, did you already solve this problem @PigApple ?
somebody who solve this error???
did you ever tried tiny-yolo ? it's work for me
@thtrieu, please help. Thanks a lot in advance.
I also meet this error. And I tested yolov1 (in folder cfg/v1.1) on 2080 Ti (11 GB) and Tesla V100 (16 GB). Unfortunately, the error 'out of memory' is always reported, even with batch size=1. (And I also tried --gpu 0.9 or --gpu 0.5)
@andikira I have tried tiny-yolo, it can train successfully on the 2080 Ti. (And yolo-voc can be trained on this GPU successfully.)
@thtrieu, please help. Thanks a lot in advance.
Here is some log information:
Parsing ./cfg/extraction.conv.cfg Parsing cfg/yolov1.cfg Loading ./bin/extraction.conv.weights ... Successfully identified 89721616 bytes Finished in 0.008398056030273438s Model has a VOC model name, loading VOC labels.
Building net ... WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:105: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
Source | Train? | Layer description | Output size -------+--------+----------------------------------+--------------- | | input | (?, 448, 448, 3) Load | Yep! | conv 7x7p3_2 +bnorm leaky | (?, 224, 224, 64) Load | Yep! | maxp 2x2p0_2 | (?, 112, 112, 64) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 112, 112, 192) Load | Yep! | maxp 2x2p0_2 | (?, 56, 56, 192) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 56, 56, 128) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 56, 56, 256) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 56, 56, 256) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 56, 56, 512) Load | Yep! | maxp 2x2p0_2 | (?, 28, 28, 512) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 28, 28, 256) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 28, 28, 512) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 28, 28, 256) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 28, 28, 512) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 28, 28, 256) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 28, 28, 512) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 28, 28, 256) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 28, 28, 512) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 28, 28, 512) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 28, 28, 1024) Load | Yep! | maxp 2x2p0_2 | (?, 14, 14, 1024) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 14, 14, 512) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 14, 14, 1024) Load | Yep! | conv 1x1p0_1 +bnorm leaky | (?, 14, 14, 512) Load | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 14, 14, 1024) Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 14, 14, 1024) Init | Yep! | conv 3x3p1_2 +bnorm leaky | (?, 7, 7, 1024) Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 7, 7, 1024) Init | Yep! | conv 3x3p1_1 +bnorm leaky | (?, 7, 7, 1024) Init | Yep! | loca 3x3p1_1 leaky | (?, 7, 7, 256) Load | Yep! | flat | (?, 12544) Init | Yep! | full 12544 x 1715 linear | (?, 1715) -------+--------+----------------------------------+--------------- GPU mode with 1.0 usage WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:132: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.
cfg/yolov1.cfg loss hyper-parameters:
side = 7
box = 3
classes = 20
scales = [1.0, 1.0, 0.5, 5.0]
WARNING:tensorflow:From /root/darkflow-master/darkflow/net/yolo/train.py:67: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Building cfg/yolov1.cfg loss
WARNING:tensorflow:From /root/darkflow-master/darkflow/net/yolo/train.py:92: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.
Building cfg/yolov1.cfg train op WARNING:tensorflow:From /usr/local/python3.7.5/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /usr/local/python3.7.5/lib/python3.7/site-packages/tensorflow_core/python/training/rmsprop.py:119: calling Ones.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:145: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:145: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
............
2021-10-14 16:31:04.891447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2021-10-14 16:31:04.891459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2021-10-14 16:31:04.891560: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-14 16:31:04.892322: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-14 16:31:04.893037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 16130 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:0d.0, compute capability: 7.0) WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:146: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
2021-10-14 16:31:17.111913: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 15.75G (16914055168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:149: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
WARNING:tensorflow:From /root/darkflow-master/darkflow/net/build.py:149: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
Finished in 30.13685703277588s
Enter training ...
cfg/yolov1.cfg parsing ./VOCdevkit/ANN/ Parsing for ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'] [====================>]100% 2009_002584.xml Statistics:
............
Dataset of 22136 instance(s) Training statistics: Learning rate : 0.001 Batch size : 1 Epoch number : 135 Backup every : 20000 2021-10-14 16:31:28.460462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-10-14 16:31:28.800983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-10-14 16:31:30.756172: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.58G (1691406336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2021-10-14 16:31:30.756922: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.58G (1691406336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory