alignment-handbook
alignment-handbook copied to clipboard
help to do SFT usning multi-machine, for example 8 nodes (1 A100 for 1 node)
I modified deepspeed_sero3.yaml, set num_machines to 8 and num_processes to 8, and I got the following error, what else should I do to run SFT on 8 nodes platform. Thanks
File "/home/work/xx/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/work/xx/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/work/xx/lib/python3.11/site-packages/accelerate/commands/launch.py", line 971, in launch_command
deepspeed_launcher(args)
File "/home/work/xx/lib/python3.11/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
distrib_run.run(args)
File "/home/work/xx/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/work/xx/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/xx/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 230, in launch_agent
master_addr, master_port = _get_addr_and_port(rdzv_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/xx/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 170, in _get_addr_and_port
master_addr, master_port = parse_rendezvous_endpoint(endpoint, default_port=-1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/xx/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 95, in parse_rendezvous_endpoint
raise ValueError(
ValueError: The port number of the rendezvous endpoint 'None:None' must be an integer between 0 and 65536.
@Atlantic8 You solved this issue?