accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

[Windows] Redirects are currently not supported in Windows or MacOs.

Open KSepetanc opened this issue 1 year ago • 2 comments

System Info

- `Accelerate` version: 0.26.1
- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.11.7
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 63.91 GB
- GPU type: NVIDIA P106-100
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [ ] My own task or dataset (give details below)

Reproduction

Running test script accelerate test with provided configuration results in an error. I can confirm custom regular sequential codes in Pytorch on single cuda gpu without Accelerate run norrmaly.

Does accelerate support Windows and how to run it considering the error?

Expected behavior

Running:  accelerate-launch C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\test_utils\scripts\test_script.py
stderr: [2024-02-29 11:07:29,607] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
stderr: [W socket.cpp:697] [c10d] The client socket has failed to connect to [DESKTOP-DL8UIP5]:29500 (system error: 10049 - The requested address is not valid in its context.).
stderr: Traceback (most recent call last):
stderr:   File "<frozen runpy>", line 198, in _run_module_as_main
stderr:   File "<frozen runpy>", line 88, in _run_code
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Scripts\accelerate-launch.exe\__main__.py", line 7, in <module>
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 1029, in main
stderr:     launch_command(args)
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 1014, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 672, in multi_gpu_launcher
stderr:     distrib_run.run(args)
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\run.py", line 803, in run
stderr:     elastic_launch(
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 135, in __call__
stderr:     return launch_agent(self._config, self._entrypoint, list(args))
stderr:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent
stderr:     result = agent.run()
stderr:              ^^^^^^^^^^^
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
stderr:     result = f(*args, **kwargs)
stderr:              ^^^^^^^^^^^^^^^^^^
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 727, in run
stderr:     result = self._invoke_run(role)
stderr:              ^^^^^^^^^^^^^^^^^^^^^^
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 862, in _invoke_run
stderr:     self._initialize_workers(self._worker_group)
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
stderr:     result = f(*args, **kwargs)
stderr:              ^^^^^^^^^^^^^^^^^^
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 699, in _initialize_workers
stderr:     self._rendezvous(worker_group)
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
stderr:     result = f(*args, **kwargs)
stderr:              ^^^^^^^^^^^^^^^^^^
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 542, in _rendezvous
stderr:     store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
stderr:                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stderr:   File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous
stderr:     self._store = TCPStore(  # type: ignore[call-arg]
stderr:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stderr: torch.distributed.DistNetworkError: Unknown error
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\test.py", line 54, in test_command
    result = execute_subprocess_async(cmd, env=os.environ.copy())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\test_utils\testing.py", line 466, in execute_subprocess_async
    raise RuntimeError(
RuntimeError: 'accelerate-launch C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\test_utils\scripts\test_script.py' failed with returncode 1

The combined stderr from workers follows:
[2024-02-29 11:07:29,607] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[W socket.cpp:697] [c10d] The client socket has failed to connect to [DESKTOP-DL8UIP5]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Scripts\accelerate-launch.exe\__main__.py", line 7, in <module>
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 1029, in main
    launch_command(args)
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 1014, in launch_command
    multi_gpu_launcher(args)
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 672, in multi_gpu_launcher
    distrib_run.run(args)
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\run.py", line 803, in run
    elastic_launch(
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 727, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 862, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 699, in _initialize_workers
    self._rendezvous(worker_group)
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 542, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Korisnik\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: Unknown error

KSepetanc avatar Feb 29 '24 10:02 KSepetanc

Looks to be a similar issue reported on pytorch: https://github.com/pytorch/pytorch/issues/116056

Personally I recommend just using WSL instead

muellerzr avatar Feb 29 '24 15:02 muellerzr

As what I make of one of the comments of the issue you linked, the error was introduced in Pytorch 2.2 and should work with 2.1.

I will try with WSL.

KSepetanc avatar Mar 01 '24 10:03 KSepetanc

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 30 '24 15:03 github-actions[bot]