Real-ESRGAN
Real-ESRGAN copied to clipboard
How to train the model with double gpu?
I train the model with double gpu, but it get something wrong. why? ! CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=21 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --auto_resume
train.py: error: unrecognized arguments: --local-rank=1
[2024-04-12 09:48:38,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 69084) of binary: /data/envs/geo_real_esrgan/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
realesrgan/train.py FAILED
Failures: [1]: time : 2024-04-12_09:48:38 host : geo517 rank : 1 (local_rank: 1) exitcode : 2 (pid: 69085) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-04-12_09:48:38 host : geo517 rank : 0 (local_rank: 0) exitcode : 2 (pid: 69084) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
buy the way,I have two gpu cards
solve:
CUDA_VISIBLE_DEVICES=0,1
python -m torch.distributed.launch --nproc_per_node=2 --master_port=4321 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --launcher pytorch --auto_resume
change
CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node=2 --master_port=4321 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --launcher pytorch --auto_resume
torchrun replace python -m torch.distributed.launch