mmdetection icon indicating copy to clipboard operation
mmdetection copied to clipboard

torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Open Majiawei opened this issue 2 years ago • 11 comments
trafficstars

When training mmdet3.x using a single machine with multiple gpus, This distribution error is reported every time after the third epoch of training. How to solve this problem?

微信图片_20230711102619

Majiawei avatar Jul 11 '23 02:07 Majiawei

i have same issue

nattametee007 avatar Jul 16 '23 05:07 nattametee007

i have same issue

I also have the same question, have you solve it?

Younger330 avatar Jul 28 '23 06:07 Younger330

i have same issue

Zhihao-z avatar Aug 03 '23 19:08 Zhihao-z

Me too

Abdulmalik0x avatar Aug 05 '23 18:08 Abdulmalik0x

the same error!!!

happybear1015 avatar Aug 11 '23 00:08 happybear1015

same

RYHSmmc avatar Nov 01 '23 03:11 RYHSmmc

same

zhaozhen2333 avatar Nov 14 '23 13:11 zhaozhen2333

how to solve it?

Bradly-s avatar Feb 26 '24 06:02 Bradly-s

I also have the same question, have you solve it?

wfq007 avatar Mar 14 '24 14:03 wfq007

try https://github.com/pytorch/pytorch/issues/121222

flytocc avatar Mar 18 '24 14:03 flytocc

related issues: https://github.com/open-mmlab/mmdetection/issues/6934#issuecomment-1066255179

wenshinlee avatar May 28 '25 06:05 wenshinlee