mmdetection
mmdetection copied to clipboard
Multi-GPU training hangs
Command run:
bash ./tools/dist_train.sh configs/carafe/mask_rcnn_r50_fpn_carafe_1x_coco.py 8
No error, but hangs after output 2022-05-12 22:06:13,674 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
and stays around ~90% utilization (continuing to fluctuate) until killed.
PyTorch version: 1.7.1 CUDA version: 11.6
Haven't been able to track down this exact problem in previous issues, but it seems like getting stuck while training in general is a known issue?
Same issue, only with multi gpu
same issue while using
tools/dist_train.sh configs/yolox/yolox_s_8x8_300e_coco.py 8
for multi-GPUS training. no bugs but stop outputting logs
2022-07-05 13:49:18,605 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
Have you solved it ?
I same issue
Same issue, and there are no any error repports before and after the killing of ./tools/dist_train.sh
same issue
same issue
same issue
+1
Hi, all, anyone solved it?