mmcv
mmcv copied to clipboard
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685)
I write my own dataset class and dataloader, and while train with mmcv.runner, I get the error "ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685)". I cannot locate the key problem according to this error report. How to resolve this issue?
Hi,could you paste all error report?
sys.platform: linux Python: 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: x86_64-linux-gnu-gcc (Debian 8.3.0-6) 8.3.0 PyTorch: 1.10.0 TorchVision: 0.11.1+cu113 OpenCV: 4.5.5 MMCV: 1.5.0 MMCV Compiler: GCC 8.3 MMCV CUDA Compiler: 11.3 MMSegmentation: 0.21.1+6585937
error_log: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024640 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024641 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024642 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024643 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024652 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024661 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024662 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 7 (pid: 2024663) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 723, in main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 719, in main run(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: tools/train.py FAILED
Hmm,actually it seems that the fault trace stack doesn't give any information for mmcv though.
In fact,you can assure you install mmcv_full correctly and the version of mmcv_full is on the same page with your CUDA_VERSION.see this issue for more detail.
If you're sure it should not blame to cuda,could you please paste your:
- running command
- training config
I believe that could help us solve the problem.
1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py 2, training config is too complex.
1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py 2, training config is too complex.
ok,but pasting training config may help us locate the poetential bugs,and paste it may help solve the problem.Did the cuda version checking helps?
I met the same problem, it happens only when the dataset is too large, eg: objects365, bigdetection. Small datasets such as coco will not cause this problem. Hope this can help with debug
ERROR:torch.distributed.elastic.multiprocessing.api:failed
i also met the same question? Any help?
@FuNian788 @ywdong hi, can you paste your training config?
I faced same error when trying to train on small dataset . I really don't know what exactly the issue , any help please?

i have met same problem, when dataset is too large
same problem here
i have the same issue.
same problem

Same issue when dataset is too large.
I had a similar issue - worked when I used 1% of my data which is 11GB. How would one go about a large?
Hi, thanks for your report. We are trying to reproduce the error.
same problem here
hello everybody, I met the same problem. And finally I found the key of this problem.
If you set your workers_per_gpu as 0, you will get the error log same as @JayQine. Otherwise, you will receive additional error about cuda --'cuDNN error: CUDNN_STATUS_NOT_INITIALIZED'.
Both of them are caused by oom. To prove my opinion, you can set a command dmesg -T | grep -E -i -B100 'killed process' in your terminal. And then you will get the information about why the process terminated.
If you want to avoid this issue, maybe you can reduce your batch size, and enter the command 'top' in your terminal to monitor memory information.
And I think if you want to solve this problem completely, you should change the way your data is preprocessed and loaded.
you need to set the launcher in init_dist(launcher,backend) according to your program. I set PyTorch
I had a similar issue
maybe you can try this, "dist.init_process_group(backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=5400))" reset timeout time.
i have same issue did you resolve?
same issue here when loading very large model
I also meet the failed ,and I think I find the solution! I think it's a error with the torchvision and torch ,when i use the torchvision0.11.2+cu10.2 and torch 1.10.1+cu11.1 I meet the errror.But When I install torchvision0.11.2+cu11.1 THE PROBLEM FIXED,hope my advise will help you
您好,感谢您的报告。我们正在尝试重现错误。
Hi, I encountered this problem when training isaid datasets. Since there are more than 10000 images in the valid set, the memory will be full when the prediction num reaches 6000-7000. It seems that the memory occupied by the predicted pictures is not released after the prediction is completed.
I reproduce the error and found that this error is related to OOM. An intuitive solution is to lower the batch_size on each GPU. During distributed training, a process exits because OOM exits. As a result, the overall training exits and raise the ERROR as mentioned above.
I think this phenomenon is highly related to https://github.com/pytorch/pytorch/issues/13246. Especially the discussion of copy-on-read overhead in this issue. For mmcv users, I think the new mmengine fixes the problem, please see the doc here.
find_unused_parameters=True Add it to config file.
The error is covered in dense warnings.

same problem. I find that some specific model (# parameters) with some specific batch size will encounter the error (ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40349)), change the batch size can fix this.
I got the same error, but I noticed empty records within json. It solves the problem for me.