magpie
magpie copied to clipboard
Enhancement proposal: making timeout behavior more dynamic
While testing, I found that when the setup time was shorter than expected, the job was unable to use all the walltime available because of the shutdown timeout.
Quickly, I see two ways to fix this.
A- Make MAGPIE_STARTUP_TIME dynamic, instead of a user fixed variable. For Moab, we could use the walltime reported by checkjob as a startup time in Magpie_wait_script function. Something along these lines:
walltime=$(checkjob ${MOAB_JOBID} | grep -Po '(?<=WallTime: \s).*' | cut -d' ' -f1))
startuptime=$((10#${walltime:0:2}*60 + 10#${walltime:3:2} + 10#${walltime:6:2}>0))
scriptsleepamounttemp=`expr ${MAGPIE_TIMELIMIT_MINUTES} - ${startuptime}`
B- Replace Magpie_wait_script by a mechanism of signal catching. Moab can be told to send a pre-termination signal at the desired time before expiration of the job's wall clock limit, for example:
-l signal=SIGHUP@5:00
This signal could be catch with the bash trap command and the script killed adequately.
Both solutions are specific to Moab for now, but I think that the mechanisms used are provided by most scheduler.
If I had to choose, I would opt for solution B which I find more elegant as it alleviates the necessity of having a timeout stop watch in Magpie.
Yeah, this is a known issue. Primarily been avoiding it b/c I wasn't sure of a portable way to handle it. But perhaps it could just be added as an "enhancement" if you're using Moab?
The enhancement could be Moab specific, but I fear we might need to add a few checks of the content of MAGPIE_SUBMISSION_TYPE.
If we move the definition of MAGPIE_STARTUP_TIME and MAGPIE_SHUTDOWN_TIME from magpie-magpie-customizations to a template that is scheduler specific, we could probably avoid a few check.
Ahh, I see, abstracting out the MAGPIE_STARTUP_TIME and MAGPIE_SHUTDOWN_TIME so we can do technique A or B. I think I'd prefer A, as I could other schedulers having similar functionality. I am skeptical of other schedulers have 'B' like functionality.
Is this something you're interested in looking into?
I have already tested the option A with Moab/Torque, so it should not be too much trouble to abstract it for other schedulers. I can definitely into this.
You mentioned this is a known issue. Do you have a list of other issues that should be addressed? Listing them on Github would be helpful to see the bigger picture :).
It's not so much a "known issue" but more of a "hmmm, maybe I'll make that better some day" :-)
Most of my "todo's" are in the TODO file. Now that I'm on github, it should definitely be converted into issue/todo lists on there.
I was talking to some folks here, and they said here they recommend people use https://github.com/chaos/libyogrt. It doesn't have a command line/scripting mechanism yet, only a C library part.
I've now added dynamic timeout behavior for slurm. In magpie-common-functions you'll find a new function called 'Magpie_job_time_minutes' which you could likely plug in your call above to Moab for the walltime left in the job.