stylegan2-pytorch icon indicating copy to clipboard operation
stylegan2-pytorch copied to clipboard

Issue when Multi-GPU is set

Open therealjjj77 opened this issue 3 years ago • 2 comments

I'm having this issue using Torch 1.7.1+cu110. Please see below:

(venv) C:\Users\Jerr\PycharmProjects\pythonProject1>stylegan2_pytorch --data C:/Transfer/Downloads/Processed/Compressed/Compressed --network-capacity 256 --trunc-psi 0.5 --aug-prob 0.25 --attn-layers 1 --top-k-training --generate-top-k
-frac 0.5 --generate-top-k-gamma 0.99 --no-pl-reg --calculate-fid-every 5000 --multi-gpus --num_workers 32
Traceback (most recent call last):
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\Jerr\PycharmProjects\pythonProject1\venv\Scripts\stylegan2_pytorch.exe_main.py", line 7, in
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\stylegan2_pytorch\cli.py", line 172, in main
fire.Fire(train_from_folder)
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\fire\core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\fire\core.py", line 468, in _Fire
target=component.name)
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\stylegan2_pytorch\cli.py", line 169, in train_from_folder
join=True)
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 157, in start_processes
while not context.join():
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 19, in _wrap
fn(i, *args)
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\stylegan2_pytorch\cli.py", line 39, in run_training
dist.init_process_group('nccl', rank=rank, world_size=world_size)
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 434, in init_process_group
init_method, rank, world_size, timeout=timeout
File "c:\users\jerr\pycharmprojects\pythonproject1\venv\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous
raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))
RuntimeError: No rendezvous handler for env://

therealjjj77 avatar Jan 26 '21 11:01 therealjjj77

I'm running on Windows 10 Home, I have a Tesla K80(it's really two 12GB GPUs) and a GeForce RTX 2070 Super. I'm trying to run this on the Tesla K80. I have successfully tested that they work with Pytorch via the DataParallel method. So I'm not sure why multi-gpu isn't working for this.

therealjjj77 avatar Jan 29 '21 13:01 therealjjj77

Let me know if you found a solution since your post. I recently posted this on the Nvidia github. I am trying gpus=2 on a node with two V100s. gpus=1 works fine. gpus=2 on train.py fails with similar traceback errors to what you describe.

metaphorz avatar Jul 07 '21 23:07 metaphorz