DeepSpeed
DeepSpeed copied to clipboard
[BUG] unable to use a hostfile with a name that is not "hostfile"
Describe the bug I am trying to launch multiple Megatron-DeepSpeed jobs on a slurm based cluster. For each job, I want to create a different hostfile called hostfile_${SLURM_JOBID}. However, when I tried to do this along with deepspeed --hostfile=hostfile_${SLURM_JOBID}, I saw from the logs that deepspeed was unable to detect the hostfile and set the worldsize to 4 (which is the number of GPUs on a single node) erroneously. Interestingly, creating a hostfile with the name 'hostfile' does not lead to this error.
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- Perlmutter supercomputer
Launcher context
- launching with the deepspeed launcher
Docker context n/A
Additional context Add any other context about the problem here.
Hi @siddharth9820, this is interesting. Can you double check the hostfiles with the slurm ids are being created and accessible at the path you are launching from?
https://github.com/microsoft/DeepSpeed/blob/bcc617a0009dd27b4e144de59979bd7770eaf57c/deepspeed/launcher/runner.py#L201
This is the check that needs to pass to pull in a hostfile on the launching node.
Hi @siddharth9820 - what is the status of this issue, were you able to see if the hostfiles were created correctly?
Hi @loadams, sorry I didn't have the bandwidth to investigate this issue further. I just chugged along with creating hostfiles named "hostfiles" and running one job at a time.
Thanks @siddharth9820 - please do let us know if you have time to investigate later