accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

The Socket Timeout is displayed when a PPO is trained on multiple machines

Open duyuwen-duen opened this issue 6 months ago • 0 comments

The Socket Timeout is displayed when a PPO is trained on multiple machines yaml: compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_multinode_launcher: standard offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 2 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

error: File "/home/ma-user/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/ma-user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command deepspeed_launcher(args) File "/home/ma-user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher distrib_run.run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent result = agent.run() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run result = self._invoke_run(role) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 862, in _invoke_run self._initialize_workers(self._worker_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 699, in _initialize_work ers self._rendezvous(worker_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 545, in _rendezvous workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 632, in _assign_worker_r anks role_infos = self._share_and_gather(store, group_rank, group_world_size, spec) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 669, in _share_and_gathe r role_infos_bytes = store_util.synchronize( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize agent_data = get_all(store, rank, key_prefix, world_size) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all data = store.get(f"{prefix}{idx}") torch.distributed.DistStoreError: Socket Timeout

duyuwen-duen avatar Jun 26 '25 14:06 duyuwen-duen