checkmk
checkmk copied to clipboard
Fix false critical on OMD backup job when agent runs at the time the backup is about to start
Prevent this false critical alert:
Host | |
---|---|
Service | OMD |
Event | OK → CRITICAL |
Time | Mon Oct 23 01:30:05 EDT 2023 |
Summary | Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06, Size: 426 MiB, Next run: 2023-10-23 01:30:00CRIT |
Details | Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06Size: 426 MiBNext run: 2023-10-23 01:30:00CRIT |
Host Metrics | rta=0.010ms;200.000;500.000;0; pl=0%;80;100;; rtmax=0.038ms;;;; rtmin=0.002ms;;;; |
Service Metrics | backup_duration=123.582501;;;; backup_avgspeed=865828.190744;;;; backup_size=446827456;;;; |
Basically, this happens when the backup is about to start (here at 01:30:00) but hasn't started yet when the agent checked (around 01:30:00 also in this case but the alert was generated at 01:30:05). In the logs, the backup actually started at 01:30:03, it's normal for a cron job to sometimes have a very small discrepancy, add to that the discrepancy between the check and the time of the alert reported by Checkmk and we get a false critical in this case. The 30 seconds buffer will prevent this corner case from every happening again.
I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.