mmcv icon indicating copy to clipboard operation
mmcv copied to clipboard

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685)

Open JayQine opened this issue 3 years ago • 46 comments

I write my own dataset class and dataloader, and while train with mmcv.runner, I get the error "ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685)". I cannot locate the key problem according to this error report. How to resolve this issue?

JayQine avatar May 18 '22 07:05 JayQine

Hi,could you paste all error report?

imabackstabber avatar May 18 '22 09:05 imabackstabber

sys.platform: linux Python: 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: x86_64-linux-gnu-gcc (Debian 8.3.0-6) 8.3.0 PyTorch: 1.10.0 TorchVision: 0.11.1+cu113 OpenCV: 4.5.5 MMCV: 1.5.0 MMCV Compiler: GCC 8.3 MMCV CUDA Compiler: 11.3 MMSegmentation: 0.21.1+6585937

error_log: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024640 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024641 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024642 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024643 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024652 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024661 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024662 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 7 (pid: 2024663) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 723, in main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 719, in main run(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: tools/train.py FAILED

JayQine avatar May 18 '22 14:05 JayQine

Hmm,actually it seems that the fault trace stack doesn't give any information for mmcv though. In fact,you can assure you install mmcv_full correctly and the version of mmcv_full is on the same page with your CUDA_VERSION.see this issue for more detail. If you're sure it should not blame to cuda,could you please paste your:

  1. running command
  2. training config

I believe that could help us solve the problem.

imabackstabber avatar May 19 '22 02:05 imabackstabber

1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py 2, training config is too complex.

JayQine avatar May 20 '22 15:05 JayQine

1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py 2, training config is too complex.

ok,but pasting training config may help us locate the poetential bugs,and paste it may help solve the problem.Did the cuda version checking helps?

imabackstabber avatar May 23 '22 06:05 imabackstabber

I met the same problem, it happens only when the dataset is too large, eg: objects365, bigdetection. Small datasets such as coco will not cause this problem. Hope this can help with debug

FuNian788 avatar Jun 21 '22 06:06 FuNian788

ERROR:torch.distributed.elastic.multiprocessing.api:failed

i also met the same question? Any help?

ywdong avatar Jun 27 '22 02:06 ywdong

@FuNian788 @ywdong hi, can you paste your training config?

imabackstabber avatar Jun 27 '22 03:06 imabackstabber

I faced same error when trying to train on small dataset . I really don't know what exactly the issue , any help please?

error in soft teacher

alaa-shubbak avatar Jul 06 '22 16:07 alaa-shubbak

i have met same problem, when dataset is too large

jianlong-yuan avatar Jul 15 '22 08:07 jianlong-yuan

same problem here

CarloSaccardi avatar Jul 20 '22 14:07 CarloSaccardi

i have the same issue.

AmrElsayed14 avatar Jul 30 '22 20:07 AmrElsayed14

same problem image

RenyunLi0116 avatar Aug 03 '22 20:08 RenyunLi0116

Same issue when dataset is too large.

ataey avatar Aug 06 '22 07:08 ataey

I had a similar issue - worked when I used 1% of my data which is 11GB. How would one go about a large?

aissak21 avatar Aug 09 '22 19:08 aissak21

Hi, thanks for your report. We are trying to reproduce the error.

zhouzaida avatar Aug 10 '22 02:08 zhouzaida

same problem here

piglaker avatar Aug 16 '22 03:08 piglaker

hello everybody, I met the same problem. And finally I found the key of this problem. If you set your workers_per_gpu as 0, you will get the error log same as @JayQine. Otherwise, you will receive additional error about cuda --'cuDNN error: CUDNN_STATUS_NOT_INITIALIZED'. Both of them are caused by oom. To prove my opinion, you can set a command dmesg -T | grep -E -i -B100 'killed process' in your terminal. And then you will get the information about why the process terminated.

If you want to avoid this issue, maybe you can reduce your batch size, and enter the command 'top' in your terminal to monitor memory information.

And I think if you want to solve this problem completely, you should change the way your data is preprocessed and loaded.

yiyexy avatar Aug 19 '22 13:08 yiyexy

you need to set the launcher in init_dist(launcher,backend) according to your program. I set PyTorch

Nomi-Q avatar Aug 20 '22 09:08 Nomi-Q

I had a similar issue

haixiongli avatar Nov 03 '22 12:11 haixiongli

maybe you can try this, "dist.init_process_group(backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=5400))" reset timeout time.

lh4027 avatar Nov 06 '22 05:11 lh4027

i have same issue did you resolve?

gihwan-kim avatar Nov 25 '22 07:11 gihwan-kim

same issue here when loading very large model

allanj avatar Dec 26 '22 05:12 allanj

I also meet the failed ,and I think I find the solution! I think it's a error with the torchvision and torch ,when i use the torchvision0.11.2+cu10.2 and torch 1.10.1+cu11.1 I meet the errror.But When I install torchvision0.11.2+cu11.1 THE PROBLEM FIXED,hope my advise will help you

yitianlian avatar Dec 27 '22 02:12 yitianlian

您好,感谢您的报告。我们正在尝试重现错误。

Hi, I encountered this problem when training isaid datasets. Since there are more than 10000 images in the valid set, the memory will be full when the prediction num reaches 6000-7000. It seems that the memory occupied by the predicted pictures is not released after the prediction is completed.

stdcoutzrh avatar Jan 03 '23 07:01 stdcoutzrh

I reproduce the error and found that this error is related to OOM. An intuitive solution is to lower the batch_size on each GPU. During distributed training, a process exits because OOM exits. As a result, the overall training exits and raise the ERROR as mentioned above.

walsvid avatar Jan 17 '23 08:01 walsvid

I think this phenomenon is highly related to https://github.com/pytorch/pytorch/issues/13246. Especially the discussion of copy-on-read overhead in this issue. For mmcv users, I think the new mmengine fixes the problem, please see the doc here.

walsvid avatar Jan 18 '23 05:01 walsvid

find_unused_parameters=True Add it to config file.

The error is covered in dense warnings. image

xu19971109 avatar Feb 22 '23 10:02 xu19971109

same problem. I find that some specific model (# parameters) with some specific batch size will encounter the error (ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40349)), change the batch size can fix this.

ggjy avatar Mar 23 '23 16:03 ggjy

I got the same error, but I noticed empty records within json. It solves the problem for me.

aqppe avatar Jul 13 '23 05:07 aqppe