Watchdog: a simplified method to compute time left to kill jobs that are going to run out of time
Currently, the Watchdog seems to compute the "time left" based on the CPU work, which is the product of the CPUtime that we get from the underlying batch system, which is (in most of the case I guess) accurate, and the CPU power, which might be not really accurate in some cases.
Then, based on this "time left" value, the watchdog seems to perform a complex logic to know whether a job should be killed or not.
- First it performs a check every
checkingTimeuntiltimeLeft < grossTimeLeftLimit-grossTimeLeftLimitbeing 18,000 see here. - When this happens,
timeLeftis then computed everypollingTimeand the variablelittleTimeLeftCount, initialized to 15, is decremented everypollingTime(it can be negative apparently) see here. - When
timeLeft < fineTimeLimitLeft-fineTimeLimitLeftbeing150 * pollingTimeby default - andlittleTimeLeftCount == 0(keeping in mind that it can also be negative), then the job is killed.
I would like to simplify this logic such as:
- We add a
TimeLeft.getCPUTimeLeft()method to get the CPU time left in seconds, andTimeLeft.getTimeLeft()in this case becomesgetCPUWorkLeft(). In the watchdog we use this new method to get the time left in seconds: I guess it would be more accurate. - Once
timeLeft < 4000s or maybe checkingTime * 1.5then we do regular check everypollingTime. - Once
timeLeft < 600s(10 minutes) then we kill the job
There is probably many historical reasons that I do not understand or use cases that I do not know that would explain this complex logic. Let me know if you have further details or comments about what I propose.
The possible historical reasons behind the current way of working of the watchdog are too historical even for me! And I have never reviewed the module myself in these last years. So I can't really judge from that point of view, maybe @atsareg or @phicharp can give you an explanation. At a first sight the logic that you are summarizing it here looks "weird" but there must certainly be some good reasons. I don't mind what you are proposing but as I said my understanding of the original logic is incomplete.
I doubt this logic really works (at least in some cases it does not seem to work properly). For instance, in SDumont, jobs are mainly stopped by the batch system rather than the Watchdog. This is certainly due to DB12 values that are not accurate enough on this Site, but this also raises a question: does the Watchdog really need to take the CPU power into account to stop a job?