tf-faster-rcnn icon indicating copy to clipboard operation
tf-faster-rcnn copied to clipboard

GPU memory issue when training

Open snsie opened this issue 7 years ago • 6 comments

I have run the test script successfully, but I am hitting memory errors when training. The log is pasted below. I have tried lowering the batch size, but that didn't fix the error. I am using a 1070GTX GPU and I have run smallcorgi's faster-rcnn repository in the past and didn't have memory issues. Has anyone else encountered this error?

  • time python3 ./tools/trainval_net.py --weight data/imagenet_weights/vgg16.ckpt --imdb voc_2007_trainval --imdbval voc_2007_test --iters 70000 --cfg experiments/cfgs/vgg16.yml --net vgg16 --set ANCHOR_SCALES '[8,16,32]' ANCHOR_RATIOS '[0.5,1,2]' TRAIN.STEPSIZE '[50000]' Called with args: Namespace(cfg_file='experiments/cfgs/vgg16.yml', imdb_name='voc_2007_trainval', imdbval_name='voc_2007_test', max_iters=70000, net='vgg16', set_cfgs=['ANCHOR_SCALES', '[8,16,32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'TRAIN.STEPSIZE', '[50000]'], tag=None, weight='data/imagenet_weights/vgg16.ckpt') Using config: {'ANCHOR_RATIOS': [0.5, 1, 2], 'ANCHOR_SCALES': [8, 16, 32], 'DATA_DIR': '/home/scott/chridemo/tf-faster-rcnn/data', 'EXP_DIR': 'vgg16', 'MATLAB': 'matlab', 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0, 'FIXED_LAYERS': 5, 'REGU_DEPTH': False, 'WEIGHT_DECAY': 4e-05}, 'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]), 'POOLING_MODE': 'crop', 'POOLING_SIZE': 7, 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False}, 'RNG_SEED': 3, 'ROOT_DIR': '/home/scott/chridemo/tf-faster-rcnn', 'RPN_CHANNELS': 512, 'TEST': {'BBOX_REG': True, 'HAS_RPN': True, 'MAX_SIZE': 1000, 'MODE': 'nms', 'NMS': 0.3, 'PROPOSAL_METHOD': 'gt', 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 300, 'RPN_PRE_NMS_TOP_N': 6000, 'RPN_TOP_N': 5000, 'SCALES': [600], 'SVM': False}, 'TRAIN': {'ASPECT_GROUPING': False, 'BATCH_SIZE': 32, 'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0], 'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2], 'BBOX_NORMALIZE_TARGETS': True, 'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True, 'BBOX_REG': True, 'BBOX_THRESH': 0.5, 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'BIAS_DECAY': False, 'DISPLAY': 20, 'DOUBLE_BIAS': True, 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'GAMMA': 0.1, 'HAS_RPN': True, 'IMS_PER_BATCH': 1, 'LEARNING_RATE': 0.001, 'MAX_SIZE': 1000, 'MOMENTUM': 0.9, 'PROPOSAL_METHOD': 'gt', 'RPN_BATCHSIZE': 32, 'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'RPN_CLOBBER_POSITIVES': False, 'RPN_FG_FRACTION': 0.5, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POSITIVE_WEIGHT': -1.0, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'SCALES': [600], 'SNAPSHOT_ITERS': 5000, 'SNAPSHOT_KEPT': 3, 'SNAPSHOT_PREFIX': 'vgg16_faster_rcnn', 'STEPSIZE': [50000], 'SUMMARY_INTERVAL': 180, 'TRUNCATED': False, 'USE_ALL_GT': True, 'USE_FLIPPED': True, 'USE_GT': False, 'WEIGHT_DECAY': 0.0001}, 'USE_GPU_NMS': True} Loaded dataset voc_2007_trainval for training Set proposal method: gt Appending horizontally-flipped training examples... voc_2007_trainval gt roidb loaded from /home/scott/chridemo/tf-faster-rcnn/data/cache/voc_2007_trainval_gt_roidb.pkl done Preparing training data... done 10022 roidb entries Output will be saved to /home/scott/chridemo/tf-faster-rcnn/output/vgg16/voc_2007_trainval/default TensorFlow summaries will be saved to /home/scott/chridemo/tf-faster-rcnn/tensorboard/vgg16/voc_2007_trainval/default Loaded dataset voc_2007_test for training Set proposal method: gt Preparing training data... voc_2007_test gt roidb loaded from /home/scott/chridemo/tf-faster-rcnn/data/cache/voc_2007_test_gt_roidb.pkl done 4952 validation roidb entries Filtered 0 roidb entries: 10022 -> 10022 Filtered 0 roidb entries: 4952 -> 4952 2018-01-17 13:33:18.699896: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-01-17 13:33:18.877826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:03:00.0 totalMemory: 7.92GiB freeMemory: 7.52GiB 2018-01-17 13:33:18.877854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:03:00.0, compute capability: 6.1) Solving... /home/scott/.local/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " Loading initial model weights from data/imagenet_weights/vgg16.ckpt Variables restored: vgg_16/conv1/conv1_1/biases:0 Variables restored: vgg_16/conv1/conv1_2/weights:0 Variables restored: vgg_16/conv1/conv1_2/biases:0 Variables restored: vgg_16/conv2/conv2_1/weights:0 Variables restored: vgg_16/conv2/conv2_1/biases:0 Variables restored: vgg_16/conv2/conv2_2/weights:0 Variables restored: vgg_16/conv2/conv2_2/biases:0 Variables restored: vgg_16/conv3/conv3_1/weights:0 Variables restored: vgg_16/conv3/conv3_1/biases:0 Variables restored: vgg_16/conv3/conv3_2/weights:0 Variables restored: vgg_16/conv3/conv3_2/biases:0 Variables restored: vgg_16/conv3/conv3_3/weights:0 Variables restored: vgg_16/conv3/conv3_3/biases:0 Variables restored: vgg_16/conv4/conv4_1/weights:0 Variables restored: vgg_16/conv4/conv4_1/biases:0 Variables restored: vgg_16/conv4/conv4_2/weights:0 Variables restored: vgg_16/conv4/conv4_2/biases:0 Variables restored: vgg_16/conv4/conv4_3/weights:0 Variables restored: vgg_16/conv4/conv4_3/biases:0 Variables restored: vgg_16/conv5/conv5_1/weights:0 Variables restored: vgg_16/conv5/conv5_1/biases:0 Variables restored: vgg_16/conv5/conv5_2/weights:0 Variables restored: vgg_16/conv5/conv5_2/biases:0 Variables restored: vgg_16/conv5/conv5_3/weights:0 Variables restored: vgg_16/conv5/conv5_3/biases:0 Variables restored: vgg_16/fc6/biases:0 Variables restored: vgg_16/fc7/biases:0 Loaded. Fix VGG16 layers.. Fixed. 2018-01-17 13:33:24.312766: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:29.614554: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.49GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:30.811321: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.52GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:34.048083: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:35.281011: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.90GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:35.424096: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.75GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:36.886421: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.49GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:37.259091: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.49GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. iter: 20 / 70000, total loss: 3.258544

rpn_loss_cls: 0.397910 rpn_loss_box: 0.524691 loss_cls: 1.501672 loss_box: 0.702400 lr: 0.001000 speed: 0.785s / iter 2018-01-17 13:33:38.927469: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.34GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:40.305694: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.46GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. iter: 40 / 70000, total loss: 1.690050 rpn_loss_cls: 0.795447 rpn_loss_box: 0.613905 loss_cls: 0.148638 loss_box: 0.000000 lr: 0.001000 speed: 0.657s / iter out of memory invalid argument an illegal memory access was encountered an illegal memory access was encountered 2018-01-17 13:33:53.393398: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:638] failed to record completion event; therefore, failed to create inter-stream dependency 2018-01-17 13:33:53.393413: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:638] failed to record completion event; therefore, failed to create inter-stream dependency 2018-01-17 13:33:53.393406: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:638] failed to record completion event; therefore, failed to create inter-stream dependency 2018-01-17 13:33:53.393399: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:638] failed to record completion event; therefore, failed to create inter-stream dependency 2018-01-17 13:33:53.393450: I tensorflow/stream_executor/stream.cc:4637] stream 0x605b5f0 did not memcpy host-to-device; source: 0x7f5930025080 2018-01-17 13:33:53.393435: I tensorflow/stream_executor/stream.cc:4637] stream 0x605b5f0 did not memcpy host-to-device; source: 0x7f593822f100 2018-01-17 13:33:53.393429: I tensorflow/stream_executor/stream.cc:4637] stream 0x605b5f0 did not memcpy host-to-device; source: 0x7f5714793cd0 2018-01-17 13:33:53.393452: I tensorflow/stream_executor/stream.cc:4637] stream 0x605b5f0 did not memcpy host-to-device; source: 0x7f5930027a90 2018-01-17 13:33:53.393470: E tensorflow/stream_executor/stream.cc:306] Error recording event in stream: error recording CUDA event on stream 0x605b690: CUDA_ERROR_ILLEGAL_ADDRESS; not marking stream as bad, as the Event object may be at fault. Monitor for further errors. 2018-01-17 13:33:53.393506: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS 2018-01-17 13:33:53.393515: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 Command terminated by signal 6 29.28user 14.32system 0:40.50elapsed 107%CPU (0avgtext+0avgdata 4591788maxresident)k 104inputs+3208outputs (0major+897254minor)pagefaults 0swaps

snsie avatar Jan 17 '18 18:01 snsie

i think you may want to get a larger sized gpu? the gpu usage can change when you train, because each image comes with a different size.

endernewton avatar Feb 08 '18 16:02 endernewton

@endernewton i got this this error with the same GPU but some time ago everything worked great. And i talking about detection on unlabeled COCO images not about training. I dont know the reason of this issue...

SAVeselovskiy avatar Feb 12 '18 20:02 SAVeselovskiy

i have gpu memory error too! and i have no idea for a long time ,my gpu is 1060 , I have run the test script successfully too , and I have tried lowering the batch size, but that didn't fix the error too...

suixin567 avatar Jan 19 '19 07:01 suixin567

@endernewton @ScottSiegel

suixin567 avatar Jan 19 '19 07:01 suixin567

Change the file in experiments/res101.yml BATCH_SIZE: and RPN_BATCHSIZE:

H-Wenfeng avatar Aug 02 '19 15:08 H-Wenfeng

Change the file in experiments/res101.yml BATCH_SIZE: and RPN_BATCHSIZE:

Sir/madam, is it work really ?

devendraswamy avatar Feb 19 '20 05:02 devendraswamy