public-roadmap icon indicating copy to clipboard operation
public-roadmap copied to clipboard

Flapping detection

Open DmitryFrolovTri opened this issue 3 years ago • 2 comments

Is your feature request related to a problem? Please describe. If a site being monitored is behind a load balancer(which is like all sites now) it is possible that only 1 node of N would be failing the check once in a while. Checkly would be sending and clearing alert in this case not allowing the support team to react as the incident would not stay open for long. So Problem statement: checkly is not able to have an open alert for a check that frequently succeedes and sometimes fails. It can only have an open alert for a check that is failing at this moment.

Describe the solution you'd like I want checkly to inform us via alert in cases when a site has an infrequent failure which repeats after some number of checks.

In the alert set up, where rules are defined have a radiobutton for a different way of detecting a failure. Let's call that "flapping detection logic" (please have a better name for it) :)

Idea of such check is:

  • Mandatory: Ask user to provide time duration (should always be longer then the frequency of the check) for example 10 minute or 5 minute interval (or alternatively time duration could be specified in multiples of check frequency. Which then can be used dynamically for any duration the check has. E.g. 2 times the frequency of the check or 3 times of the frequency of the check or 4 times the frequency of the check") .In other words time duration can be = 3 times. For a check of every minute it will be 1x3 or can just be 3 minute time duration. I like the time duration to be specified as number of checks.
  • Optional: Show to user number of checks that would be ran in the above selected duration.
  • Mandatory: Ask user the minimum amount of checks (checks number) failed in the above duration for which the alert shall be risen.

Then during check lifetime if during the specified time duration the number of failed checks is >= checks numbers the alert is raised. During next time duration If there is already an open alert and again number of failed checks is >= checks numbers then alert is kept open otherwise it is closed.

Such way of alert generation could allow flapping detection and would also alert on total downtime.

For example for me: We have 13 nodes behind a web site and sometimes randomly 1 of those would fail and keep failing. Normal logic - checkly would raise 1 alert and close it once it hits randomly this one node. With above logic I could setup following - if during 10 checks(10 minutes) I have one or more failure I would like to alert, which would stay open while we continuosly have this or more number of alerts in those 10 minute intervals.

Describe alternatives you've considered There is no other way to implement this with current logic However, since I am sure the above algorithm is not the only one for flapping detection any other AI or smart or self-adjusting mechanims is good as well.

DmitryFrolovTri avatar Feb 11 '22 09:02 DmitryFrolovTri

@DmitryFrolovTri thanks for the extensive write up. I think you are essentially describing SLO's, where there is an error budget for a time period. This is something I will keep in mind

tnolet avatar Feb 14 '22 09:02 tnolet

If the service behind the LB adds it's node ID or GUID to the header of the response, could you get Checkly to store/action that to identify the erroneous backend service?

alexnoyes avatar Feb 14 '22 17:02 alexnoyes