accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

[WIP] DeepSpeed launcher related changes

Open pacman100 opened this issue 3 years ago • 2 comments

What does this PR do?

  1. Removing 1 sub-process call for DeepSpeed for Single Node Multi-GPU setup and Multi Node Multi-GPU setup using Standard launcher.

As discussed offline, the distrib_run.run(distrib_args) is stuck indefinitely for Multi Node Multi-GPU setup even for standard DDP. This needs to be fixed as that will solve Multi Node Multi-GPU setup using Standard launcher of DeepSpeed integration (Other DeepSpeed launchers like PDSH are working fine for Multi Node Multi-GPU)

pacman100 avatar Aug 11 '22 14:08 pacman100

The documentation is not available anymore as the PR was closed or merged.

There is a bit too much in this PR to wrap my head around. Can we split it between multiGPU launcher fixes, DeepSpeed launcher fixes and other fixes? Thanks!

  1. MultiGPU launcher fixes and simplification was put in another PR by Zach #627
  2. Minor other fixes are in #630
  3. This will make deepspeed launcher updates to remove a call to subprocess

pacman100 avatar Aug 12 '22 06:08 pacman100