tf-faster-rcnn
tf-faster-rcnn copied to clipboard
GPU memory issue when training
I have run the test script successfully, but I am hitting memory errors when training. The log is pasted below. I have tried lowering the batch size, but that didn't fix the error. I am using a 1070GTX GPU and I have run smallcorgi's faster-rcnn repository in the past and didn't have memory issues. Has anyone else encountered this error?
- time python3 ./tools/trainval_net.py --weight data/imagenet_weights/vgg16.ckpt --imdb voc_2007_trainval --imdbval voc_2007_test --iters 70000 --cfg experiments/cfgs/vgg16.yml --net vgg16 --set ANCHOR_SCALES '[8,16,32]' ANCHOR_RATIOS '[0.5,1,2]' TRAIN.STEPSIZE '[50000]'
Called with args:
Namespace(cfg_file='experiments/cfgs/vgg16.yml', imdb_name='voc_2007_trainval', imdbval_name='voc_2007_test', max_iters=70000, net='vgg16', set_cfgs=['ANCHOR_SCALES', '[8,16,32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'TRAIN.STEPSIZE', '[50000]'], tag=None, weight='data/imagenet_weights/vgg16.ckpt')
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8, 16, 32],
'DATA_DIR': '/home/scott/chridemo/tf-faster-rcnn/data',
'EXP_DIR': 'vgg16',
'MATLAB': 'matlab',
'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
'FIXED_LAYERS': 5,
'REGU_DEPTH': False,
'WEIGHT_DECAY': 4e-05},
'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'crop',
'POOLING_SIZE': 7,
'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
'RNG_SEED': 3,
'ROOT_DIR': '/home/scott/chridemo/tf-faster-rcnn',
'RPN_CHANNELS': 512,
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'MODE': 'nms',
'NMS': 0.3,
'PROPOSAL_METHOD': 'gt',
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'RPN_TOP_N': 5000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': False,
'BATCH_SIZE': 32,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'BIAS_DECAY': False,
'DISPLAY': 20,
'DOUBLE_BIAS': True,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'GAMMA': 0.1,
'HAS_RPN': True,
'IMS_PER_BATCH': 1,
'LEARNING_RATE': 0.001,
'MAX_SIZE': 1000,
'MOMENTUM': 0.9,
'PROPOSAL_METHOD': 'gt',
'RPN_BATCHSIZE': 32,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_ITERS': 5000,
'SNAPSHOT_KEPT': 3,
'SNAPSHOT_PREFIX': 'vgg16_faster_rcnn',
'STEPSIZE': [50000],
'SUMMARY_INTERVAL': 180,
'TRUNCATED': False,
'USE_ALL_GT': True,
'USE_FLIPPED': True,
'USE_GT': False,
'WEIGHT_DECAY': 0.0001},
'USE_GPU_NMS': True}
Loaded dataset
voc_2007_trainvalfor training Set proposal method: gt Appending horizontally-flipped training examples... voc_2007_trainval gt roidb loaded from /home/scott/chridemo/tf-faster-rcnn/data/cache/voc_2007_trainval_gt_roidb.pkl done Preparing training data... done 10022 roidb entries Output will be saved to/home/scott/chridemo/tf-faster-rcnn/output/vgg16/voc_2007_trainval/defaultTensorFlow summaries will be saved to/home/scott/chridemo/tf-faster-rcnn/tensorboard/vgg16/voc_2007_trainval/defaultLoaded datasetvoc_2007_testfor training Set proposal method: gt Preparing training data... voc_2007_test gt roidb loaded from /home/scott/chridemo/tf-faster-rcnn/data/cache/voc_2007_test_gt_roidb.pkl done 4952 validation roidb entries Filtered 0 roidb entries: 10022 -> 10022 Filtered 0 roidb entries: 4952 -> 4952 2018-01-17 13:33:18.699896: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-01-17 13:33:18.877826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:03:00.0 totalMemory: 7.92GiB freeMemory: 7.52GiB 2018-01-17 13:33:18.877854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:03:00.0, compute capability: 6.1) Solving... /home/scott/.local/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " Loading initial model weights from data/imagenet_weights/vgg16.ckpt Variables restored: vgg_16/conv1/conv1_1/biases:0 Variables restored: vgg_16/conv1/conv1_2/weights:0 Variables restored: vgg_16/conv1/conv1_2/biases:0 Variables restored: vgg_16/conv2/conv2_1/weights:0 Variables restored: vgg_16/conv2/conv2_1/biases:0 Variables restored: vgg_16/conv2/conv2_2/weights:0 Variables restored: vgg_16/conv2/conv2_2/biases:0 Variables restored: vgg_16/conv3/conv3_1/weights:0 Variables restored: vgg_16/conv3/conv3_1/biases:0 Variables restored: vgg_16/conv3/conv3_2/weights:0 Variables restored: vgg_16/conv3/conv3_2/biases:0 Variables restored: vgg_16/conv3/conv3_3/weights:0 Variables restored: vgg_16/conv3/conv3_3/biases:0 Variables restored: vgg_16/conv4/conv4_1/weights:0 Variables restored: vgg_16/conv4/conv4_1/biases:0 Variables restored: vgg_16/conv4/conv4_2/weights:0 Variables restored: vgg_16/conv4/conv4_2/biases:0 Variables restored: vgg_16/conv4/conv4_3/weights:0 Variables restored: vgg_16/conv4/conv4_3/biases:0 Variables restored: vgg_16/conv5/conv5_1/weights:0 Variables restored: vgg_16/conv5/conv5_1/biases:0 Variables restored: vgg_16/conv5/conv5_2/weights:0 Variables restored: vgg_16/conv5/conv5_2/biases:0 Variables restored: vgg_16/conv5/conv5_3/weights:0 Variables restored: vgg_16/conv5/conv5_3/biases:0 Variables restored: vgg_16/fc6/biases:0 Variables restored: vgg_16/fc7/biases:0 Loaded. Fix VGG16 layers.. Fixed. 2018-01-17 13:33:24.312766: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:29.614554: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.49GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:30.811321: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.52GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:34.048083: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:35.281011: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.90GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:35.424096: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.75GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:36.886421: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.49GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:37.259091: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.49GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. iter: 20 / 70000, total loss: 3.258544
rpn_loss_cls: 0.397910 rpn_loss_box: 0.524691 loss_cls: 1.501672 loss_box: 0.702400 lr: 0.001000 speed: 0.785s / iter 2018-01-17 13:33:38.927469: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.34GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2018-01-17 13:33:40.305694: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.46GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. iter: 40 / 70000, total loss: 1.690050 rpn_loss_cls: 0.795447 rpn_loss_box: 0.613905 loss_cls: 0.148638 loss_box: 0.000000 lr: 0.001000 speed: 0.657s / iter out of memory invalid argument an illegal memory access was encountered an illegal memory access was encountered 2018-01-17 13:33:53.393398: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:638] failed to record completion event; therefore, failed to create inter-stream dependency 2018-01-17 13:33:53.393413: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:638] failed to record completion event; therefore, failed to create inter-stream dependency 2018-01-17 13:33:53.393406: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:638] failed to record completion event; therefore, failed to create inter-stream dependency 2018-01-17 13:33:53.393399: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:638] failed to record completion event; therefore, failed to create inter-stream dependency 2018-01-17 13:33:53.393450: I tensorflow/stream_executor/stream.cc:4637] stream 0x605b5f0 did not memcpy host-to-device; source: 0x7f5930025080 2018-01-17 13:33:53.393435: I tensorflow/stream_executor/stream.cc:4637] stream 0x605b5f0 did not memcpy host-to-device; source: 0x7f593822f100 2018-01-17 13:33:53.393429: I tensorflow/stream_executor/stream.cc:4637] stream 0x605b5f0 did not memcpy host-to-device; source: 0x7f5714793cd0 2018-01-17 13:33:53.393452: I tensorflow/stream_executor/stream.cc:4637] stream 0x605b5f0 did not memcpy host-to-device; source: 0x7f5930027a90 2018-01-17 13:33:53.393470: E tensorflow/stream_executor/stream.cc:306] Error recording event in stream: error recording CUDA event on stream 0x605b690: CUDA_ERROR_ILLEGAL_ADDRESS; not marking stream as bad, as the Event object may be at fault. Monitor for further errors. 2018-01-17 13:33:53.393506: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS 2018-01-17 13:33:53.393515: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 Command terminated by signal 6 29.28user 14.32system 0:40.50elapsed 107%CPU (0avgtext+0avgdata 4591788maxresident)k 104inputs+3208outputs (0major+897254minor)pagefaults 0swaps
i think you may want to get a larger sized gpu? the gpu usage can change when you train, because each image comes with a different size.
@endernewton i got this this error with the same GPU but some time ago everything worked great. And i talking about detection on unlabeled COCO images not about training. I dont know the reason of this issue...
i have gpu memory error too! and i have no idea for a long time ,my gpu is 1060 , I have run the test script successfully too , and I have tried lowering the batch size, but that didn't fix the error too...
@endernewton @ScottSiegel
Change the file in experiments/res101.yml BATCH_SIZE: and RPN_BATCHSIZE:
Change the file in experiments/res101.yml BATCH_SIZE: and RPN_BATCHSIZE:
Sir/madam, is it work really ?