torchx
torchx copied to clipboard
SLURM quality of life improvements
Description
Making a couple of requests to improve QoL on SLURM
Detailed Proposal
It would be helpful to have -
- [x] The ability to specify the output path. Currently, you need to cd to the right path for this, which generally needs a helper function to set up the directory, cd to it, and then launch via torchx. torchx can ideally handle it for us. #416
- [x] Code isolation and reproducibility. While doing research, we make a change, launch an experiment, and repeat. To make sure each experiment uses the same consistent code, we copy the code to the experiment directory (which also helps with reproducibility). #416
- [ ] Verification of the passed launch script. If I launch from a wrong directory for instance, I would still queue up the job, wait for a few minutes / hours only to crash because of a wrong path (i.e. the launch script does not exist).
- [x] Being able to specify a job name - SLURM shows job details when running the
squeuecommand including the job name. If our jobs are all run via torchx, every job will be namedtrain_app-{i}which makes it hard to identify which experiment / project the job is from. - [x] The
timeargument doesn't say what the unit is - maybe we just follow the SLURM API, but it would be nice if we clarified that. - [ ] torchx submits jobs in heterogeneous mode. This is something FAIR users don't have familiarity with - I'm guessing in terms of execution and command support there should be feature and scheduling speed parity (not sure about the latter)? The
squeuelogs show every node as a separate line - so a 32 node job would take 32 lines instead of 1. This just makes it harder to monitor jobs - not a technical issue, just a QoL one :) - [x] The job logs are created in
slurm-{job-id}-train_app-{node-id}.outfiles (per node) and a singleslurm-{job-id}.out. Normally, our jobs instead have logs of the form{job-id}-{node-id}.outand{job-id}-{node-id}.err(per node) - the separation betweenstderrandstdouthelps find which machine actually crashed more easily. And I'm not sure whatslurm-{job-id}.outcorresponds to - maybe it's a consequence of the heterogeneous jobs? With torchelastic, it becomes harder to debug which node crashed since every node logs a crash (so grepping forTracebackwill return each log file instead of just the node which originally crashed) - maybe there is a way to figure this out and I just don't know what to look for? - [ ] The
global_rankis not equal tolocal_rank + node_id * gpus_per_node, i.e. the global rank 0 can be on node 3. - [ ] automatically set nomem on pcluster
adding this support for slurm wouldn't be too bad:
- generalize the workspace file logic from docker_workspace (.torchxignore)
- add a job_dir argument to allow specifying an isolation env
- change launch code to cp + chdir
- add some statefile (.torchxjobdirs) so torchx log knows where to find logs for slurm
something like job_dir we could relatively easily extend to local_cwd, local_docker -- more complex for k8s/batch/ray
For the heterogenous jobs displaying differently, that's tricky in the current model. The macros like replica_id generally need be applied on a per worker basis. If we wrap the app in a runtime it does allow us to materialize those later though it does add an extra dependency. Slurm virtualenv/conda will have TorchX installed anyways in most cases so that's not necessarily a blocker but changes the model from what we've had so far
https://github.com/pytorch/torchx/blob/main/torchx/specs/api.py#L138
I did look but doesn't appear that sacct/squeue has a way to hide child jobs. You can use torchx status so we could add a torchx queue method to render this better for all schedulers
You can use torchx status so we could add a torchx queue method to render this better for all schedulers
I think it's hard to see us migrating to use torchx status - squeue gives the status of all jobs which is what I normally check. torchx wouldn't even be aware of all the jobs being run (since they might have been queued outside of torchx). Even if it did, that's introducing a new workflow which I'd imagine most people would want to avoid (unless it gave them some benefit).
re: The job logs are created in per node files
https://github.com/pytorch/torchx/pull/412 makes it so that when running with dist.ddp the node stdout and stderr log lines are prefixed with the local_rank of the worker that produced that line. So you'd see something akin to this:

https://github.com/pytorch/torchx/pull/412 makes it so that when running with dist.ddp the node stdout and stderr log lines are prefixed with the local_rank of the worker that produced that line...
We need to work with the lightning team to make sure that the ranks displayed here match the ones used in lightning, which isn't guaranteed to be the case right now as @kiukchung and I discovered the other day.