DeepSpeed deepspeed/launcher: add launcher_helper as each rank's start portal

File Changes: multinode_runner.py: modify mpich runner to use launcher_helper launcher_helper.py: init script to map env variables

Descriptions: Previous mpich runner would cause linux command line reaching size limitations when rank number is extremely higher. After discussion, we want to optimize it by using a helper script as each rank's start portal, which maps env variables such as rank, local_rank for deepspeed. So far we only use it for mpich runner, but it is made to be extendable, any runner could be added if facing similar situation. Only necessary args are passed to helper script.

Let us know if there is any suggestion.

Nov 17 '23 08:11 YizhouZ

@tjruwase Hi, could you please help to start workflow for this PR? Thanks.

Nov 20 '23 06:11 YizhouZ

Hi @jeffra @awan-10 @tjruwase, is there any comment on this PR? Thanks!

Dec 01 '23 08:12 YizhouZ

Hi @YizhouZ can you show the command line launched by deepspeed before and after your PR, illuatrating how your PR could help reduce command line length? Thanks!

Dec 11 '23 02:12 delock

Hi @YizhouZ can you show the command line launched by deepspeed before and after your PR, illuatrating how your PR could help reduce command line length? Thanks!

Sure.

Without this PR, while using mpich runner, the cmd would be like

cmd = mpirun --genv xxx -n 1 -env xxx python xxx.py : -n 1 -env xxx python xxx.py : <repeat local number times> : -n 1 -env xxx python xxx.py

When the number of total ranks goes high, the cmd would be extremely long, and trigger cmd word size limitations.

After this PR, the cmd would be much shorter:

cmd = mpirun --genv xxx python xxx.py <with no more additional cmds>

Dec 18 '23 06:12 YizhouZ

Hi @mrwyattii Do you have any comments on this PR? This PR is essential when need to run DeepSpeed training on thousands of nodes with MPICH. The former implementation would make command line too long that overflow the command line buffer. The new implementation fix this issue.

Jan 16 '24 08:01 delock

Hi @tjruwase @mrwyattii, do you have any comments on this PR? Like @delock said previously, this new implement is essential for us to enable DeepSpeed Training on a large amount of nodes. Otherwise training process would reach linux command limits.

Jan 25 '24 06:01 YizhouZ

@YizhouZ, @delock, thanks for this great PR. Apologies for the delay. We will review asap.

Jan 25 '24 11:01 tjruwase

Thank you so much for this quick merge! @tjruwase @mrwyattii @loadams

Jan 29 '24 06:01 YizhouZ