mmdetection icon indicating copy to clipboard operation
mmdetection copied to clipboard

Multi-GPU training hangs

Open stephanie-fu opened this issue 2 years ago • 7 comments

Command run: bash ./tools/dist_train.sh configs/carafe/mask_rcnn_r50_fpn_carafe_1x_coco.py 8

No error, but hangs after output 2022-05-12 22:06:13,674 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. and stays around ~90% utilization (continuing to fluctuate) until killed.

PyTorch version: 1.7.1 CUDA version: 11.6

Haven't been able to track down this exact problem in previous issues, but it seems like getting stuck while training in general is a known issue?

stephanie-fu avatar May 13 '22 02:05 stephanie-fu

Same issue, only with multi gpu

Chop1 avatar May 19 '22 13:05 Chop1

same issue while using tools/dist_train.sh configs/yolox/yolox_s_8x8_300e_coco.py 8 for multi-GPUS training. no bugs but stop outputting logs 2022-07-05 13:49:18,605 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. Have you solved it ?

jayphone17 avatar Jul 05 '22 06:07 jayphone17

I same issue

tuan97ta avatar Aug 20 '22 02:08 tuan97ta

Same issue, and there are no any error repports before and after the killing of ./tools/dist_train.sh

dwluo avatar Oct 24 '22 07:10 dwluo

same issue

ccccwb avatar Oct 28 '22 03:10 ccccwb

same issue

JamesZWalker avatar Dec 20 '22 07:12 JamesZWalker

same issue

yuhua666 avatar Dec 20 '22 08:12 yuhua666

+1

Lee-inso avatar Mar 06 '23 09:03 Lee-inso

Hi, all, anyone solved it?

OrangeSodahub avatar May 22 '23 14:05 OrangeSodahub