centreon-engine
centreon-engine copied to clipboard
Soft recovery after Non-OK->OK
This is a feature request.
With non-OK states, we already can set via the retries setting how often the check is retried until the host is considered down (hard state). I want to have the same for the OK state, i.e., to be able to enforce multiple successful retries before the host is considered OK.
Sometimes hosts (or services) are down and recover for a short time before failing again. This is not the same as flapping, since flapping checks for frequent OK<->non-OK changes. My case is more like non-OK, non-OK, ...., non-OK, OK, non-OK, ... , non-OK
The problem is, that the short period of OK causes two notifications (a host-up and another host-down notification).
Escalation schemes also do not work well, because the downtime counter starts again at zero.
Hi and thank you for your ticket.
Did you already take a look at Detection and Handling of State Flapping ? It might be the solution for your original problem.
Let me know if it fits your needs.
Yes, I know flapping but it does not solve the issue.
Flapping looks at how many state changes happened and compares this to the number of checks done. Most of my checks the service every minute - centreon looks at the last n checks (n=21 as default) and determine how many state changes have occurred. If all except one check result in non-OK this means 2 state changes out of 20 possible = 10% state changes. Setting the flapping threshold this low causes a lot of problems, e.g., if all checks except one are OK (because there has been a short disturbance) the host will be considered as flapping.
The next problem is: if my host is down for a longer period of time (lets say for 180 checks) but recovers for short amount of time during this period, this is what will happen:
OK non-OK ... non-OK <-- host is considered down after the predefined amount of non-OK checks non-OK ... non-OK OK <--- now the host is considered UP again (mails will be send, etc) non-OK <-- if flapping is set to a low value the host is now considered flapping non-OK ... non-OK <-- host will be considered non-flapping after n checks ... non-OK OK <-- now we up again ... non-OK
Using the flap_detection_options to exclude some states will also not help here.
@lpinsivy what do you think about this feature ?
IMHO if an host stay in a long time in non-OK status, the next OK status is an event and it is good to be notified about this event.
Why your host stay to many long time in non-OK status ?
This is a semantic problem. If a test results in "DOWN" then it is considered a "soft down" and the test is repeated for "Max Check Attempts" times until Centreon is sure that the host is really down and that this is not just a measurement error. The purpose of retrying the check is to deal with possible measurement errors. I want the same possibility for checks that yield OK for the same reason: handling potential measurement errors.
A typical check is ping, where RTT and packet loss are measured. If they go above a certain value (eg. RTT > 3000 ms or PL > 30%) then the connection (=host) is considered DOWN. Depending on the problem it might occur by chance that single tests fall below this threshold although the connection is still broken. A typical measurement error.