checkmk icon indicating copy to clipboard operation
checkmk copied to clipboard

Fix false critical on OMD backup job when agent runs at the time the backup is about to start

Open dnlldl opened this issue 8 months ago • 0 comments

Prevent this false critical alert:

Host
Service OMD backup
Event OK → CRITICAL
Time Mon Oct 23 01:30:05 EDT 2023
Summary Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06, Size: 426 MiB, Next run: 2023-10-23 01:30:00CRIT
Details Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06Size: 426 MiBNext run: 2023-10-23 01:30:00CRIT
Host Metrics rta=0.010ms;200.000;500.000;0; pl=0%;80;100;; rtmax=0.038ms;;;; rtmin=0.002ms;;;;
Service Metrics backup_duration=123.582501;;;; backup_avgspeed=865828.190744;;;; backup_size=446827456;;;;

Basically, this happens when the backup is about to start (here at 01:30:00) but hasn't started yet when the agent checked (around 01:30:00 also in this case but the alert was generated at 01:30:05). In the logs, the backup actually started at 01:30:03, it's normal for a cron job to sometimes have a very small discrepancy, add to that the discrepancy between the check and the time of the alert reported by Checkmk and we get a false critical in this case. The 30 seconds buffer will prevent this corner case from every happening again. I'm aware 30 seconds is very arbitrary, could just take 2 checks or similar before it turns critical instead, any suggestion is welcomed.

I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.

dnlldl avatar Jun 20 '24 01:06 dnlldl