deepspeed/launcher: add launcher_helper as each rank's start portal
File Changes: multinode_runner.py: modify mpich runner to use launcher_helper launcher_helper.py: init script to map env variables
Descriptions: Previous mpich runner would cause linux command line reaching size limitations when rank number is extremely higher. After discussion, we want to optimize it by using a helper script as each rank's start portal, which maps env variables such as rank, local_rank for deepspeed. So far we only use it for mpich runner, but it is made to be extendable, any runner could be added if facing similar situation. Only necessary args are passed to helper script.
Let us know if there is any suggestion.
@tjruwase Hi, could you please help to start workflow for this PR? Thanks.
Hi @jeffra @awan-10 @tjruwase, is there any comment on this PR? Thanks!
Hi @YizhouZ can you show the command line launched by deepspeed before and after your PR, illuatrating how your PR could help reduce command line length? Thanks!
Hi @YizhouZ can you show the command line launched by deepspeed before and after your PR, illuatrating how your PR could help reduce command line length? Thanks!
Sure.
Without this PR, while using mpich runner, the cmd would be like
cmd = mpirun --genv xxx -n 1 -env xxx python xxx.py : -n 1 -env xxx python xxx.py : <repeat local number times> : -n 1 -env xxx python xxx.py
When the number of total ranks goes high, the cmd would be extremely long, and trigger cmd word size limitations.
After this PR, the cmd would be much shorter:
cmd = mpirun --genv xxx python xxx.py <with no more additional cmds>
Hi @mrwyattii Do you have any comments on this PR? This PR is essential when need to run DeepSpeed training on thousands of nodes with MPICH. The former implementation would make command line too long that overflow the command line buffer. The new implementation fix this issue.
Hi @tjruwase @mrwyattii, do you have any comments on this PR? Like @delock said previously, this new implement is essential for us to enable DeepSpeed Training on a large amount of nodes. Otherwise training process would reach linux command limits.
@YizhouZ, @delock, thanks for this great PR. Apologies for the delay. We will review asap.
Thank you so much for this quick merge! @tjruwase @mrwyattii @loadams