accelerate
accelerate copied to clipboard
[WIP] DeepSpeed launcher related changes
What does this PR do?
- Removing 1 sub-process call for DeepSpeed for
Single Node Multi-GPU setupandMulti Node Multi-GPU setup using Standard launcher.
As discussed offline, the distrib_run.run(distrib_args) is stuck indefinitely for Multi Node Multi-GPU setup even for standard DDP. This needs to be fixed as that will solve Multi Node Multi-GPU setup using Standard launcher of DeepSpeed integration (Other DeepSpeed launchers like PDSH are working fine for Multi Node Multi-GPU)
The documentation is not available anymore as the PR was closed or merged.
There is a bit too much in this PR to wrap my head around. Can we split it between multiGPU launcher fixes, DeepSpeed launcher fixes and other fixes? Thanks!
- MultiGPU launcher fixes and simplification was put in another PR by Zach #627
- Minor other fixes are in #630
- This will make deepspeed launcher updates to remove a call to subprocess