accelerate
accelerate copied to clipboard
[Windows] Redirects are currently not supported in Windows or MacOs.
System Info
- `Accelerate` version: 0.26.1
- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.11.7
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 63.91 GB
- GPU type: NVIDIA P106-100
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [ ] My own task or dataset (give details below)
Reproduction
Running test script accelerate test with provided configuration results in an error.
I can confirm custom regular sequential codes in Pytorch on single cuda gpu without Accelerate run norrmaly.
Does accelerate support Windows and how to run it considering the error?
Expected behavior
Running: accelerate-launch C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\test_utils\scripts\test_script.py
stderr: [2024-02-29 11:07:29,607] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
stderr: [W socket.cpp:697] [c10d] The client socket has failed to connect to [DESKTOP-DL8UIP5]:29500 (system error: 10049 - The requested address is not valid in its context.).
stderr: Traceback (most recent call last):
stderr: File "<frozen runpy>", line 198, in _run_module_as_main
stderr: File "<frozen runpy>", line 88, in _run_code
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Scripts\accelerate-launch.exe\__main__.py", line 7, in <module>
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 1029, in main
stderr: launch_command(args)
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 1014, in launch_command
stderr: multi_gpu_launcher(args)
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 672, in multi_gpu_launcher
stderr: distrib_run.run(args)
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\run.py", line 803, in run
stderr: elastic_launch(
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 135, in __call__
stderr: return launch_agent(self._config, self._entrypoint, list(args))
stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent
stderr: result = agent.run()
stderr: ^^^^^^^^^^^
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
stderr: result = f(*args, **kwargs)
stderr: ^^^^^^^^^^^^^^^^^^
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 727, in run
stderr: result = self._invoke_run(role)
stderr: ^^^^^^^^^^^^^^^^^^^^^^
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 862, in _invoke_run
stderr: self._initialize_workers(self._worker_group)
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
stderr: result = f(*args, **kwargs)
stderr: ^^^^^^^^^^^^^^^^^^
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 699, in _initialize_workers
stderr: self._rendezvous(worker_group)
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
stderr: result = f(*args, **kwargs)
stderr: ^^^^^^^^^^^^^^^^^^
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 542, in _rendezvous
stderr: store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stderr: File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous
stderr: self._store = TCPStore( # type: ignore[call-arg]
stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stderr: torch.distributed.DistNetworkError: Unknown error
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Scripts\accelerate.exe\__main__.py", line 7, in <module>
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\test.py", line 54, in test_command
result = execute_subprocess_async(cmd, env=os.environ.copy())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\test_utils\testing.py", line 466, in execute_subprocess_async
raise RuntimeError(
RuntimeError: 'accelerate-launch C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\test_utils\scripts\test_script.py' failed with returncode 1
The combined stderr from workers follows:
[2024-02-29 11:07:29,607] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[W socket.cpp:697] [c10d] The client socket has failed to connect to [DESKTOP-DL8UIP5]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Scripts\accelerate-launch.exe\__main__.py", line 7, in <module>
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 1029, in main
launch_command(args)
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 1014, in launch_command
multi_gpu_launcher(args)
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 672, in multi_gpu_launcher
distrib_run.run(args)
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\run.py", line 803, in run
elastic_launch(
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 727, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 862, in _invoke_run
self._initialize_workers(self._worker_group)
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 699, in _initialize_workers
self._rendezvous(worker_group)
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 542, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: Unknown error
Looks to be a similar issue reported on pytorch: https://github.com/pytorch/pytorch/issues/116056
Personally I recommend just using WSL instead
As what I make of one of the comments of the issue you linked, the error was introduced in Pytorch 2.2 and should work with 2.1.
I will try with WSL.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.