mmdetection Multi-GPU training hangs

Multi-GPU training hangs

Open stephanie-fu opened this issue 2 years ago • 7 comments

Command run: bash ./tools/dist_train.sh configs/carafe/mask_rcnn_r50_fpn_carafe_1x_coco.py 8

No error, but hangs after output 2022-05-12 22:06:13,674 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. and stays around ~90% utilization (continuing to fluctuate) until killed.

PyTorch version: 1.7.1 CUDA version: 11.6

Haven't been able to track down this exact problem in previous issues, but it seems like getting stuck while training in general is a known issue?

May 13 '22 02:05 stephanie-fu

Same issue, only with multi gpu

May 19 '22 13:05 Chop1

same issue while using tools/dist_train.sh configs/yolox/yolox_s_8x8_300e_coco.py 8 for multi-GPUS training. no bugs but stop outputting logs 2022-07-05 13:49:18,605 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. Have you solved it ?