mmsegmentation
mmsegmentation copied to clipboard
ERROR:torch.distributed.elastic.multiprocessing.api:failed
I write my own dataset class and dataloader, and while train with mmcv.runner, I get the error "ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685)". I cannot locate the key problem according to this error report. How to resolve this issue?
Could you please provide more details about your problem, like your train platform info and a full error log?
sys.platform: linux Python: 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: x86_64-linux-gnu-gcc (Debian 8.3.0-6) 8.3.0 PyTorch: 1.10.0 TorchVision: 0.11.1+cu113 OpenCV: 4.5.5 MMCV: 1.5.0 MMCV Compiler: GCC 8.3 MMCV CUDA Compiler: 11.3 MMSegmentation: 0.21.1+6585937
error_log: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024640 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024641 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024642 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024643 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024652 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024661 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024662 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 7 (pid: 2024663) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 723, in main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 719, in main run(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: tools/train.py FAILED
Whether to start the program with tools/dist_train.sh and specify the number of GPUs.
Yes, I start it with tools/dist_train.sh and the number of gpus is 8.
Is it possible to train properly using the original config in mmseg? For example, use pspnet_r50 to train on ADE20K.
I have the same error and I am able to run tools/dist_train.sh configs/pspnet/pspnet_r101-d8_512x512_80k_ade20k.py 1
I have the same error
I have the same error: NET/IB : Got completion from peer 172.16.0.88<47857> with error 12, opcode 32699, len 0, vendor err 129 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 145 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 146) of binary: /usr/bin/python ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 300.0204758644104 seconds
Try reducing num_processes or batch_size. This is how I solved a similar problem.