rrl icon indicating copy to clipboard operation
rrl copied to clipboard

Cannot run it on windows

Open frankl1 opened this issue 1 year ago • 5 comments

Hi,

I was trying to give try to this implementation after reading the paper. I installed all the dependencies in a Conda env on a Window PC. However, I am having the following error when I run the experiment:

$ python experiment.py -d tic-tac-toe -bs 32 -s 1@16 -e401 -lrde 200 -lr 0.002 -ki 0 -wd 0.0001 --print_rule -i 0
C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\distributed_c10d.py:608: UserWarning: Attempted 
to get default timeout for nccl backend, but NCCL support is not compiled
  warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
[W socket.cpp:697] [c10d] The client socket has failed to connect to [A2207000547.china.huawei.com]:47339 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
  File "C:\Users\m00827298\Codes\RRL\experiment.py", line 174, in <module>
    train_main(rrl_args)
  File "C:\Users\m00827298\Codes\RRL\experiment.py", line 167, in train_main
    mp.spawn(train_model, nprocs=args.gpus, args=(args,))
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 241, in spawn       
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 197, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 158, in join        
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 68, in _wrap        
    fn(i, *args)
  File "C:\Users\m00827298\Codes\RRL\experiment.py", line 57, in train_model
    dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper    
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1177, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\rendezvous.py", line 246, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\rendezvous.py", line 174, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
torch.distributed.DistNetworkError: Unknown error

frankl1 avatar Feb 14 '24 10:02 frankl1

I am not very familiar with running PyTorch in a Windows environment. Based on the error message "Attempted to get default timeout for nccl backend, but NCCL support is not compiled", I suspect the reason might be that NCCL support is not compiled into your PyTorch installation.

12wang3 avatar Feb 19 '24 02:02 12wang3

NCCL seems to be related to NVidia GPU and I don't NVidia on my PC so I guess this is the reason I have this warning. Isn't it possible to run the code using only the CPU?

frankl1 avatar Feb 21 '24 09:02 frankl1

At present, CPU is not supported. I will add a CPU version in the future. However, it is still recommended to run on a GPU, otherwise the speed may be slow.

12wang3 avatar Mar 17 '24 11:03 12wang3

"I would like to ask if your issue has been resolved?"

wanmaxiaobai avatar Apr 29 '24 04:04 wanmaxiaobai

Thanks for asking. I will give it another try when I get a GPU

frankl1 avatar Apr 29 '24 06:04 frankl1