DIRAC icon indicating copy to clipboard operation
DIRAC copied to clipboard

Watchdog: a simplified method to compute time left to kill jobs that are going to run out of time

Open aldbr opened this issue 4 years ago • 2 comments

Currently, the Watchdog seems to compute the "time left" based on the CPU work, which is the product of the CPUtime that we get from the underlying batch system, which is (in most of the case I guess) accurate, and the CPU power, which might be not really accurate in some cases.

Then, based on this "time left" value, the watchdog seems to perform a complex logic to know whether a job should be killed or not.

  • First it performs a check every checkingTime until timeLeft < grossTimeLeftLimit - grossTimeLeftLimit being 18,000 see here.
  • When this happens, timeLeft is then computed every pollingTime and the variable littleTimeLeftCount, initialized to 15, is decremented every pollingTime (it can be negative apparently) see here.
  • When timeLeft < fineTimeLimitLeft - fineTimeLimitLeft being 150 * pollingTime by default - and littleTimeLeftCount == 0 (keeping in mind that it can also be negative), then the job is killed.

I would like to simplify this logic such as:

  • We add a TimeLeft.getCPUTimeLeft() method to get the CPU time left in seconds, and TimeLeft.getTimeLeft() in this case becomes getCPUWorkLeft(). In the watchdog we use this new method to get the time left in seconds: I guess it would be more accurate.
  • Once timeLeft < 4000s or maybe checkingTime * 1.5 then we do regular check every pollingTime.
  • Once timeLeft < 600s (10 minutes) then we kill the job

There is probably many historical reasons that I do not understand or use cases that I do not know that would explain this complex logic. Let me know if you have further details or comments about what I propose.

aldbr avatar Apr 30 '21 09:04 aldbr

The possible historical reasons behind the current way of working of the watchdog are too historical even for me! And I have never reviewed the module myself in these last years. So I can't really judge from that point of view, maybe @atsareg or @phicharp can give you an explanation. At a first sight the logic that you are summarizing it here looks "weird" but there must certainly be some good reasons. I don't mind what you are proposing but as I said my understanding of the original logic is incomplete.

fstagni avatar May 03 '21 17:05 fstagni

I doubt this logic really works (at least in some cases it does not seem to work properly). For instance, in SDumont, jobs are mainly stopped by the batch system rather than the Watchdog. This is certainly due to DB12 values that are not accurate enough on this Site, but this also raises a question: does the Watchdog really need to take the CPU power into account to stop a job?

aldbr avatar May 06 '21 10:05 aldbr