scr icon indicating copy to clipboard operation
scr copied to clipboard

python: improve integration with user job batch scripts

Open adammoody opened this issue 1 year ago • 0 comments

Our scr_run.py script currently launches the user job with the launcher process via subprocess.Popen. There are a few challenges with this:

  1. Currently, we buffer all stdout and stderr and only print those out at the end. Users will want us to at least print this more frequently as the job runs, since people want to monitor their output while the job is running.
  2. In some cases, users may also need to forward stdin?
  3. Running with profilers/debuggers may be complicated, since those need to wrap the launcher like totalview srun -a ...

It would be good to look into solutions for the above.

As a fallback, and perhaps as the recommended approach, we should also ensure that people can continue to use their existing job scripts and just add a few additional commands to integrate with SCR. At the least, I think we want to allow users to invoke:

  • scr_prerun - to prepare the allocation for SCR
  • scr_list_down_nodes - to rely on SCR to test for node health and return a list of down or heathly nodes. Leave it to the user to then incorporate that list into a relaunch command. Documentation here can help, e.g., pointing users to srun -x <downnodes> as a way to avoid certain nodes with srun.
  • scr_should_exit - to determine whether to stop the run. This will check that there are enough healthy nodes, enough time, and verify that an SCR halt condition has not been set.
  • scr_postrun - to check for and scavenge any cached datasets

For users with bash job scripts, we want these commands to return 0/1 exit codes. Output like the node list should be printed to stdout, and it should be formatted in a way to make it easy for the user to integrate, e.g., potentially format the down node list differently for srun vs jsrun.

For users with python job scripts, we get bonus points if they can import and use SCR modules. For the first pass, let's just stick with requiring the user's python job script to invoke these as commands like the bash job scripts do.

adammoody avatar Oct 23 '23 19:10 adammoody