mmcv ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local

I write my own dataset class and dataloader, and while train with mmcv.runner, I get the error "ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685)". I cannot locate the key problem according to this error report. How to resolve this issue?

May 18 '22 07:05 JayQine

Hi,could you paste all error report?

May 18 '22 09:05 imabackstabber

sys.platform: linux Python: 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: x86_64-linux-gnu-gcc (Debian 8.3.0-6) 8.3.0 PyTorch: 1.10.0 TorchVision: 0.11.1+cu113 OpenCV: 4.5.5 MMCV: 1.5.0 MMCV Compiler: GCC 8.3 MMCV CUDA Compiler: 11.3 MMSegmentation: 0.21.1+6585937

error_log: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024640 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024641 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024642 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024643 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024652 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024661 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024662 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 7 (pid: 2024663) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 723, in main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 719, in main run(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: tools/train.py FAILED

May 18 '22 14:05 JayQine

Hmm,actually it seems that the fault trace stack doesn't give any information for mmcv though. In fact,you can assure you install mmcv_full correctly and the version of mmcv_full is on the same page with your CUDA_VERSION.see this issue for more detail. If you're sure it should not blame to cuda,could you please paste your:

running command
training config

I believe that could help us solve the problem.

May 19 '22 02:05 imabackstabber

1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py 2, training config is too complex.

May 20 '22 15:05 JayQine

1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py 2, training config is too complex.

ok,but pasting training config may help us locate the poetential bugs,and paste it may help solve the problem.Did the cuda version checking helps?

May 23 '22 06:05 imabackstabber

I met the same problem, it happens only when the dataset is too large, eg: objects365, bigdetection. Small datasets such as coco will not cause this problem. Hope this can help with debug

Jun 21 '22 06:06 FuNian788

ERROR:torch.distributed.elastic.multiprocessing.api:failed

i also met the same question? Any help?

Jun 27 '22 02:06 ywdong

@FuNian788 @ywdong hi, can you paste your training config?

Jun 27 '22 03:06 imabackstabber

I faced same error when trying to train on small dataset . I really don't know what exactly the issue , any help please?

error in soft teacher

Jul 06 '22 16:07 alaa-shubbak

i have met same problem, when dataset is too large

Jul 15 '22 08:07 jianlong-yuan

same problem here

Jul 20 '22 14:07 CarloSaccardi

i have the same issue.

Jul 30 '22 20:07 AmrElsayed14

same problem

Aug 03 '22 20:08 RenyunLi0116

Same issue when dataset is too large.

Aug 06 '22 07:08 ataey

I had a similar issue - worked when I used 1% of my data which is 11GB. How would one go about a large?

Aug 09 '22 19:08 aissak21

Hi, thanks for your report. We are trying to reproduce the error.

Aug 10 '22 02:08 zhouzaida

same problem here

Aug 16 '22 03:08 piglaker

hello everybody, I met the same problem. And finally I found the key of this problem. If you set your workers_per_gpu as 0, you will get the error log same as @JayQine. Otherwise, you will receive additional error about cuda --'cuDNN error: CUDNN_STATUS_NOT_INITIALIZED'. Both of them are caused by oom. To prove my opinion, you can set a command dmesg -T | grep -E -i -B100 'killed process' in your terminal. And then you will get the information about why the process terminated.

If you want to avoid this issue, maybe you can reduce your batch size, and enter the command 'top' in your terminal to monitor memory information.

And I think if you want to solve this problem completely, you should change the way your data is preprocessed and loaded.

Aug 19 '22 13:08 yiyexy

you need to set the launcher in init_dist(launcher,backend) according to your program. I set PyTorch

Aug 20 '22 09:08 Nomi-Q

I had a similar issue

Nov 03 '22 12:11 haixiongli

maybe you can try this, "dist.init_process_group(backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=5400))" reset timeout time.

Nov 06 '22 05:11 lh4027

i have same issue did you resolve?

Nov 25 '22 07:11 gihwan-kim

same issue here when loading very large model

Dec 26 '22 05:12 allanj

I also meet the failed ,and I think I find the solution! I think it's a error with the torchvision and torch ,when i use the torchvision0.11.2+cu10.2 and torch 1.10.1+cu11.1 I meet the errror.But When I install torchvision0.11.2+cu11.1 THE PROBLEM FIXED,hope my advise will help you

Dec 27 '22 02:12 yitianlian

您好，感谢您的报告。我们正在尝试重现错误。

Hi, I encountered this problem when training isaid datasets. Since there are more than 10000 images in the valid set, the memory will be full when the prediction num reaches 6000-7000. It seems that the memory occupied by the predicted pictures is not released after the prediction is completed.

Jan 03 '23 07:01 stdcoutzrh

I reproduce the error and found that this error is related to OOM. An intuitive solution is to lower the batch_size on each GPU. During distributed training, a process exits because OOM exits. As a result, the overall training exits and raise the ERROR as mentioned above.

Jan 17 '23 08:01 walsvid

I think this phenomenon is highly related to https://github.com/pytorch/pytorch/issues/13246. Especially the discussion of copy-on-read overhead in this issue. For mmcv users, I think the new mmengine fixes the problem, please see the doc here.

Jan 18 '23 05:01 walsvid

find_unused_parameters=True Add it to config file.

The error is covered in dense warnings.

Feb 22 '23 10:02 xu19971109

same problem. I find that some specific model (# parameters) with some specific batch size will encounter the error (ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40349)), change the batch size can fix this.

Mar 23 '23 16:03 ggjy

I got the same error, but I noticed empty records within json. It solves the problem for me.

Jul 13 '23 05:07 aqppe

mmcv mmcv copied to clipboard

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685)

mmcv
mmcv copied to clipboard