scr
scr copied to clipboard
python: improve integration with user job batch scripts
Our scr_run.py
script currently launches the user job with the launcher process via subprocess.Popen
. There are a few challenges with this:
- Currently, we buffer all
stdout
andstderr
and only print those out at the end. Users will want us to at least print this more frequently as the job runs, since people want to monitor their output while the job is running. - In some cases, users may also need to forward
stdin
? - Running with profilers/debuggers may be complicated, since those need to wrap the launcher like
totalview srun -a ...
It would be good to look into solutions for the above.
As a fallback, and perhaps as the recommended approach, we should also ensure that people can continue to use their existing job scripts and just add a few additional commands to integrate with SCR. At the least, I think we want to allow users to invoke:
-
scr_prerun
- to prepare the allocation for SCR -
scr_list_down_nodes
- to rely on SCR to test for node health and return a list of down or heathly nodes. Leave it to the user to then incorporate that list into a relaunch command. Documentation here can help, e.g., pointing users tosrun -x <downnodes>
as a way to avoid certain nodes with srun. -
scr_should_exit
- to determine whether to stop the run. This will check that there are enough healthy nodes, enough time, and verify that an SCR halt condition has not been set. -
scr_postrun
- to check for and scavenge any cached datasets
For users with bash job scripts, we want these commands to return 0/1 exit codes. Output like the node list should be printed to stdout, and it should be formatted in a way to make it easy for the user to integrate, e.g., potentially format the down node list differently for srun
vs jsrun
.
For users with python job scripts, we get bonus points if they can import and use SCR modules. For the first pass, let's just stick with requiring the user's python job script to invoke these as commands like the bash job scripts do.