rrl
rrl copied to clipboard
Cannot run it on windows
Hi,
I was trying to give try to this implementation after reading the paper. I installed all the dependencies in a Conda env on a Window PC. However, I am having the following error when I run the experiment:
$ python experiment.py -d tic-tac-toe -bs 32 -s 1@16 -e401 -lrde 200 -lr 0.002 -ki 0 -wd 0.0001 --print_rule -i 0
C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\distributed_c10d.py:608: UserWarning: Attempted
to get default timeout for nccl backend, but NCCL support is not compiled
warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
[W socket.cpp:697] [c10d] The client socket has failed to connect to [A2207000547.china.huawei.com]:47339 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "C:\Users\m00827298\Codes\RRL\experiment.py", line 174, in <module>
train_main(rrl_args)
File "C:\Users\m00827298\Codes\RRL\experiment.py", line 167, in train_main
mp.spawn(train_model, nprocs=args.gpus, args=(args,))
File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 197, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 158, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 68, in _wrap
fn(i, *args)
File "C:\Users\m00827298\Codes\RRL\experiment.py", line 57, in train_model
dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1177, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\rendezvous.py", line 174, in _create_c10d_store
return TCPStore(
^^^^^^^^^
torch.distributed.DistNetworkError: Unknown error
I am not very familiar with running PyTorch in a Windows environment. Based on the error message "Attempted to get default timeout for nccl backend, but NCCL support is not compiled", I suspect the reason might be that NCCL support is not compiled into your PyTorch installation.
NCCL seems to be related to NVidia GPU and I don't NVidia on my PC so I guess this is the reason I have this warning. Isn't it possible to run the code using only the CPU?
At present, CPU is not supported. I will add a CPU version in the future. However, it is still recommended to run on a GPU, otherwise the speed may be slow.
"I would like to ask if your issue has been resolved?"
Thanks for asking. I will give it another try when I get a GPU