Restormer
Restormer copied to clipboard
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
Hi,
I've been trying to train the deraining model on your datasets for the last one week, but every time I run the train.sh script, the data loaders get created and then I get the following error:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23058) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
basicsr/train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2022-04-17_01:06:49 host : instance-1.c.cs4705-hw4.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 23058) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I've tried training this on Colab, GCP, and my local machine, and the only time it runs is when I train with num_gpus=1, at which point the ETA for a single epoch is 2 days.
Any help would be greatly appreciated.
Thanks!
No module named torch.distributed。Problems encountered in train,me too!
Hi,
I've been trying to train the deraining model on your datasets for the last one week, but every time I run the train.sh script, the data loaders get created and then I get the following error:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23058) of binary: /opt/conda/bin/python
Traceback (most recent call last): File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in main() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
basicsr/train.py FAILED
Failures:
<NO_OTHER_FAILURES> Root Cause (first observed failure): [0]: time : 2022-04-17_01:06:49 host : instance-1.c.cs4705-hw4.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 23058) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I've tried training this on Colab, GCP, and my local machine, and the only time it runs is when I train with num_gpus=1, at which point the ETA for a single epoch is 2 days.
Any help would be greatly appreciated.
Thanks! The CUDA version is incorrect
Please follow the installation instructions at INSTALL.md. Could you check your PyTorch version installed in the environment.
This happened with pytorch1.12.1 + cuda11.6, but the graphics card doesn't support cuda10.2, so I couldn't install it using install.md
这是来自QQ邮箱的假期自动回复邮件。你好,我最近正在休假中,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。
@wizzwu
I hope the link below will be helpful to you.
https://github.com/WongKinYiu/yolov7/issues/1696#issuecomment-1665866230
@wizzwu 老哥你的问题解决了吗 我也遇到这个问题说无法识别参数--local_rank