Co-DETR icon indicating copy to clipboard operation
Co-DETR copied to clipboard

torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Open GexLoong opened this issue 1 year ago • 6 comments

你好,我运行sh tools/dist_train.sh projects/configs/co_dino_vit/co_dino_5scale_vit_large_coco.py 1时,会如下错误,我看之前也有人报这个错,请问这个问题如何解决?谢谢 image

GexLoong avatar Oct 03 '24 03:10 GexLoong

我的环境配置:

  • python 3.7.16
  • torch 1.10.0
  • torchvision 0.10.0
  • mmcv-full 1.6.1 torch1.10

GexLoong avatar Oct 03 '24 03:10 GexLoong

你好,我运行sh tools/dist_train.sh projects/configs/co_dino_vit/co_dino_5scale_vit_large_coco.py 1时,会如下错误,我看之前也有人报这个错,请问这个问题如何解决?谢谢 image

请问您解决了 我也遇到了相同的错误

syxkk avatar Oct 09 '24 03:10 syxkk

@MyGitHub-G 我看图片里的报错是没有指明work-dir这个参数

TempleX98 avatar Oct 09 '24 07:10 TempleX98

@MyGitHub-G 我看图片里的报错是没有指明work-dir这个参数

image 您好 其实我的跟他不一样 我的是在训练完一轮之后 测试到最后一个batch报错了 运行下面的命令 sh tools/dist_train.sh projects/configs/co_deformable_detr/co_deformable_detr_r50_1x_coco.py 2 path_to_exp 配置python=3.7.11,pytorch=1.11.0,cuda=11.3 mmcv-full=1.5.0 gpu是a100 40gb 期待您的回复

syxkk avatar Oct 09 '24 08:10 syxkk

@MyGitHub-G 我看图片里的报错是没有指明work-dir这个参数

这个应该不是没有指明work-dir参数的问题,代码里面如果没有指明会有默认目录创建。这个问题貌似是程序运行的问题,我如果不用sh运行,直接运行train.py,会报错core dump

GexLoong avatar Oct 09 '24 14:10 GexLoong

请问你解决了吗?我在训练完一轮,接着测试就报错了

  • mmdet - INFO - Saving checkpoint at 1 epochs [ ] 0/5000, elapsed: 0s, ETA:Traceback (most recent call last): File "/home/Co-DETR-main/tools/train.py", line 245, in main() File "/home/Co-DETR-main/tools/train.py", line 234, in main train_detector( File "/home/Co-DETR-main/mmdet/apis/train.py", line 245, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/anaconda3/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook('after_train_epoch') File "/home/anaconda3/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/anaconda3/lib/python3.9/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch self._do_evaluate(runner) File "/home/Co-DETR-main/mmdet/core/evaluation/eval_hooks.py", line 126, in _do_evaluate results = multi_gpu_test( File "/home/Co-DETR-main/mmdet/apis/test.py", line 109, in multi_gpu_test result = model(return_loss=False, rescale=True, **data) File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/anaconda3/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func return old_func(*args, **kwargs) File "/home/Co-DETR-main/mmdet/models/detectors/base.py", line 174, in forward return self.forward_test(img, img_metas, **kwargs) File "/home/Co-DETR-main/mmdet/models/detectors/base.py", line 137, in forward_test img_meta[img_id]['batch_input_shape'] = tuple(img.size()[-2:]) TypeError: 'DataContainer' object is not subscriptable ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2163076) of binary: /home/anaconda3/bin/python Traceback (most recent call last): File "/home/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

adjawdka avatar Jan 04 '25 04:01 adjawdka