torchx SLURM quality of life improvements

Description

Making a couple of requests to improve QoL on SLURM

Detailed Proposal

It would be helpful to have -

[x] The ability to specify the output path. Currently, you need to cd to the right path for this, which generally needs a helper function to set up the directory, cd to it, and then launch via torchx. torchx can ideally handle it for us. #416
[x] Code isolation and reproducibility. While doing research, we make a change, launch an experiment, and repeat. To make sure each experiment uses the same consistent code, we copy the code to the experiment directory (which also helps with reproducibility). #416
[ ] Verification of the passed launch script. If I launch from a wrong directory for instance, I would still queue up the job, wait for a few minutes / hours only to crash because of a wrong path (i.e. the launch script does not exist).
[x] Being able to specify a job name - SLURM shows job details when running the squeue command including the job name. If our jobs are all run via torchx, every job will be named train_app-{i} which makes it hard to identify which experiment / project the job is from.
[x] The time argument doesn't say what the unit is - maybe we just follow the SLURM API, but it would be nice if we clarified that.
[ ] torchx submits jobs in heterogeneous mode. This is something FAIR users don't have familiarity with - I'm guessing in terms of execution and command support there should be feature and scheduling speed parity (not sure about the latter)? The squeue logs show every node as a separate line - so a 32 node job would take 32 lines instead of 1. This just makes it harder to monitor jobs - not a technical issue, just a QoL one :)
[x] The job logs are created in slurm-{job-id}-train_app-{node-id}.out files (per node) and a single slurm-{job-id}.out. Normally, our jobs instead have logs of the form {job-id}-{node-id}.out and {job-id}-{node-id}.err (per node) - the separation between stderr and stdout helps find which machine actually crashed more easily. And I'm not sure what slurm-{job-id}.out corresponds to - maybe it's a consequence of the heterogeneous jobs? With torchelastic, it becomes harder to debug which node crashed since every node logs a crash (so grepping for Traceback will return each log file instead of just the node which originally crashed) - maybe there is a way to figure this out and I just don't know what to look for?
[ ] The global_rank is not equal to local_rank + node_id * gpus_per_node, i.e. the global rank 0 can be on node 3.
[ ] automatically set nomem on pcluster

Mar 04 '22 17:03 mannatsingh

adding this support for slurm wouldn't be too bad:

generalize the workspace file logic from docker_workspace (.torchxignore)
add a job_dir argument to allow specifying an isolation env
change launch code to cp + chdir
add some statefile (.torchxjobdirs) so torchx log knows where to find logs for slurm

something like job_dir we could relatively easily extend to local_cwd, local_docker -- more complex for k8s/batch/ray

Mar 04 '22 21:03 d4l3k

For the heterogenous jobs displaying differently, that's tricky in the current model. The macros like replica_id generally need be applied on a per worker basis. If we wrap the app in a runtime it does allow us to materialize those later though it does add an extra dependency. Slurm virtualenv/conda will have TorchX installed anyways in most cases so that's not necessarily a blocker but changes the model from what we've had so far

https://github.com/pytorch/torchx/blob/main/torchx/specs/api.py#L138

I did look but doesn't appear that sacct/squeue has a way to hide child jobs. You can use torchx status so we could add a torchx queue method to render this better for all schedulers

Mar 07 '22 21:03 d4l3k

You can use torchx status so we could add a torchx queue method to render this better for all schedulers

I think it's hard to see us migrating to use torchx status - squeue gives the status of all jobs which is what I normally check. torchx wouldn't even be aware of all the jobs being run (since they might have been queued outside of torchx). Even if it did, that's introducing a new workflow which I'd imagine most people would want to avoid (unless it gave them some benefit).

Mar 08 '22 20:03 mannatsingh

re: The job logs are created in per node files

https://github.com/pytorch/torchx/pull/412 makes it so that when running with dist.ddp the node stdout and stderr log lines are prefixed with the local_rank of the worker that produced that line. So you'd see something akin to this:

Mar 08 '22 20:03 kiukchung

https://github.com/pytorch/torchx/pull/412 makes it so that when running with dist.ddp the node stdout and stderr log lines are prefixed with the local_rank of the worker that produced that line...

We need to work with the lightning team to make sure that the ranks displayed here match the ones used in lightning, which isn't guaranteed to be the case right now as @kiukchung and I discovered the other day.

Mar 14 '22 05:03 mannatsingh

torchx torchx copied to clipboard

SLURM quality of life improvements

Description

Detailed Proposal

torchx
torchx copied to clipboard