magpie icon indicating copy to clipboard operation
magpie copied to clipboard

Enhancement proposal: making timeout behavior more dynamic

Open cmd-ntrf opened this issue 10 years ago • 7 comments

While testing, I found that when the setup time was shorter than expected, the job was unable to use all the walltime available because of the shutdown timeout.

Quickly, I see two ways to fix this. A- Make MAGPIE_STARTUP_TIME dynamic, instead of a user fixed variable. For Moab, we could use the walltime reported by checkjob as a startup time in Magpie_wait_script function. Something along these lines:

  walltime=$(checkjob ${MOAB_JOBID} | grep -Po '(?<=WallTime:  \s).*' | cut -d' ' -f1))
  startuptime=$((10#${walltime:0:2}*60 + 10#${walltime:3:2} + 10#${walltime:6:2}>0))
  scriptsleepamounttemp=`expr ${MAGPIE_TIMELIMIT_MINUTES} - ${startuptime}`

B- Replace Magpie_wait_script by a mechanism of signal catching. Moab can be told to send a pre-termination signal at the desired time before expiration of the job's wall clock limit, for example:

-l signal=SIGHUP@5:00

This signal could be catch with the bash trap command and the script killed adequately.

Both solutions are specific to Moab for now, but I think that the mechanisms used are provided by most scheduler.

If I had to choose, I would opt for solution B which I find more elegant as it alleviates the necessity of having a timeout stop watch in Magpie.

cmd-ntrf avatar Jan 09 '15 15:01 cmd-ntrf

Yeah, this is a known issue. Primarily been avoiding it b/c I wasn't sure of a portable way to handle it. But perhaps it could just be added as an "enhancement" if you're using Moab?

chu11 avatar Jan 09 '15 20:01 chu11

The enhancement could be Moab specific, but I fear we might need to add a few checks of the content of MAGPIE_SUBMISSION_TYPE.

If we move the definition of MAGPIE_STARTUP_TIME and MAGPIE_SHUTDOWN_TIME from magpie-magpie-customizations to a template that is scheduler specific, we could probably avoid a few check.

cmd-ntrf avatar Jan 09 '15 20:01 cmd-ntrf

Ahh, I see, abstracting out the MAGPIE_STARTUP_TIME and MAGPIE_SHUTDOWN_TIME so we can do technique A or B. I think I'd prefer A, as I could other schedulers having similar functionality. I am skeptical of other schedulers have 'B' like functionality.

Is this something you're interested in looking into?

chu11 avatar Jan 09 '15 23:01 chu11

I have already tested the option A with Moab/Torque, so it should not be too much trouble to abstract it for other schedulers. I can definitely into this.

You mentioned this is a known issue. Do you have a list of other issues that should be addressed? Listing them on Github would be helpful to see the bigger picture :).

cmd-ntrf avatar Jan 12 '15 15:01 cmd-ntrf

It's not so much a "known issue" but more of a "hmmm, maybe I'll make that better some day" :-)

Most of my "todo's" are in the TODO file. Now that I'm on github, it should definitely be converted into issue/todo lists on there.

chu11 avatar Jan 12 '15 18:01 chu11

I was talking to some folks here, and they said here they recommend people use https://github.com/chaos/libyogrt. It doesn't have a command line/scripting mechanism yet, only a C library part.

chu11 avatar Feb 26 '15 21:02 chu11

I've now added dynamic timeout behavior for slurm. In magpie-common-functions you'll find a new function called 'Magpie_job_time_minutes' which you could likely plug in your call above to Moab for the walltime left in the job.

chu11 avatar Jun 26 '15 23:06 chu11