🐛[BUG]: `physicsnemo.distributed.DistributedManager` expects `srun` rather than `sbatch`
Version
1.0.0
On which installation method(s) does this occur?
Pip
Describe the issue
When running DistributedManager within a SLURM script using sbatch, the process fails at the DistributedManager.setup(...) call (line 353). This is due to the absence of the SLURM_LAUNCH_NODE_IPADDR environmental variable when using sbatch.
This variable is available when running with srun, but not with sbatch. As a result, any script using DistributedManager in a SLURM job submitted via sbatch encounters a KeyError.
A workaround is to retrieve the SLURM_LAUNCH_NODE_IPADDR using srun and export it manually in a SLURM script prior to running any python code that invokes DistributedManager, e.g.:
export SLURM_LAUNCH_NODE_IPADDR=$(srun printenv | awk -F= '/^SLURM_LAUNCH_NODE_IPADDR/{print $2}')
It's probably worth adding in a line about checking the existence of the SLURM_LAUNCH_NODE_IPADDR variable before calling DistributedManager.setup(...) so it can fail gracefully. Or potentially write a subproccess command to retrieve it.
Minimum reproducible example
Relevant log output
Environment details
Hi @Tristanjmeyers - sorry to keep you waiting, I was on vacation last week.
I see the challenge you're talking about. Looks like the DistributedManager typically expects srun interfaces, I think implicitly it's expecting your launch script to look like this, in a launch_training.sh bash script:
!/bin/bash -l
#SLURM SETUP PARAMETERS
srun [srun params] python train.py [python parameters]
You would then run sbatch launch_training.sh and slurm would run your script, calling srun from the launch node and setting the needed variables correctly.
Your workflow seems different, are you directly calling sbatch train.py or similar? IMO, it's good that this raises a KeyError instead of silently doing the wrong thing. I appreciate your workaround gets the variable, but from the physicsnemo slurm integration it might make more sense to force a failure and submission script change. One problem with a subprocess quietly getting the variable is that sbatch might launch the python script without MPI or other settings, and each script will set up it's own SLURM_LAUNCH_NODE_IPADDR variable. This will certainly cause the later init_process_group call to hang until it timesout in 300s. Could be even more confusing.
This stack overflow post has a good comparision of srun vs. sbatch. I'm sure you've already seen this! But, it might be a useful reference if anyone comes across this issue in the future.
If you'd like to open a PR to check that SLURM_LAUNCH_NODE_IPADDR exists before calling setup, I'd be happy to take a look. A more graceful failure with an informative error message ("use srun, not sbatch, to launch the python scripts") would be appropriate? Just tag me as a reviewer if you'd like to do this.
No worries. That makes sense.
I call sbatch train.sl where train.sl looks something like (note I use pixi to manage my environments on our supercomputer):
!/bin/bash
#SBATCH --job-name=gpu_training_job
#SBATCH --cluster=hpc
#SBATCH --partition=niwa_work
#SBATCH --time=23:59:00
#SBATCH --mem=50G
#SBATCH --cpus-per-task=1
#SBATCH --gpus-per-node=A100:1
#SBATCH --account=niwa_account
#SBATCH --ntasks=1
module load cuDNN/8.8.0.121-CUDA-12.0.0
export SLURM_LAUNCH_NODE_IPADDR=$(srun printenv | awk -F= '/^SLURM_LAUNCH_NODE_IPADDR/{print $2}')
pixi run -e gpu "python train.py"
I'd be happy to create a PR for this. Cheers.
OK, I think I see the problem at least. If you're expecting train.py to be a multi-gpu, parallel execution of a single program, you'd need to wrap it in srun. As it is, though, you're requesting 1 GPU and 1 CPU - you can certainly launch via sbatch a single instance of a program. Since you don't need anything distributed, maybe just skip DistributedManager in that case? Or, launch via srun with a single process inside this sbatch script.
Your PR is still welcome if you want to clean up the error message, but I think in your particular case you can work around more easily with srun too. Not sure how it interacts with pixi, I've never used it myself.
Sorry for the long delay. Was on holidays! But here you go, pretty simple PR but let me know if I missed anything or you want anything changed:
Hope you had a good holiday :).
PR looks good, nice and straightforward. I'll merge it once CI tests pass and it should be available in the upcoming release.
Going to close this for now, since it's been stale for 2 months. Let's reopen the issue and sort out the PR if/when this is an issue again?
@Tristanjmeyers It would be great if you can share your work and how PhysicsNeMo has been helpful in the Discussion section. Also it will be useful to get https://github.com/NVIDIA/physicsnemo/discussions/1205 on what functionality in PhysicsNeMo has been most helpful to you.