DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] unable to use a hostfile with a name that is not "hostfile"

Open siddharth9820 opened this issue 11 months ago • 4 comments

Describe the bug I am trying to launch multiple Megatron-DeepSpeed jobs on a slurm based cluster. For each job, I want to create a different hostfile called hostfile_${SLURM_JOBID}. However, when I tried to do this along with deepspeed --hostfile=hostfile_${SLURM_JOBID}, I saw from the logs that deepspeed was unable to detect the hostfile and set the worldsize to 4 (which is the number of GPUs on a single node) erroneously. Interestingly, creating a hostfile with the name 'hostfile' does not lead to this error.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • Perlmutter supercomputer

Launcher context

  • launching with the deepspeed launcher

Docker context n/A

Additional context Add any other context about the problem here.

siddharth9820 avatar Mar 03 '24 12:03 siddharth9820

Hi @siddharth9820, this is interesting. Can you double check the hostfiles with the slurm ids are being created and accessible at the path you are launching from?

https://github.com/microsoft/DeepSpeed/blob/bcc617a0009dd27b4e144de59979bd7770eaf57c/deepspeed/launcher/runner.py#L201

This is the check that needs to pass to pull in a hostfile on the launching node.

jeffra avatar Mar 03 '24 17:03 jeffra

Hi @siddharth9820 - what is the status of this issue, were you able to see if the hostfiles were created correctly?

loadams avatar Apr 15 '24 17:04 loadams

Hi @loadams, sorry I didn't have the bandwidth to investigate this issue further. I just chugged along with creating hostfiles named "hostfiles" and running one job at a time.

siddharth9820 avatar Apr 15 '24 21:04 siddharth9820

Thanks @siddharth9820 - please do let us know if you have time to investigate later

loadams avatar Apr 29 '24 19:04 loadams