sensu icon indicating copy to clipboard operation
sensu copied to clipboard

Feature request: retry_interval for checks

Open Hydrochoerus opened this issue 8 years ago • 5 comments

I couldn't find if this is already possible somehow. I would like to have the ability to configure check to be run on a different interval depending on the previous check result.

I have a process that collects data from different sources once a day and a monitoring script to check that the data for the day was collected successfully. Because of this I'd like to run the check only once per day normally, but when the check fails I'd like to change the check interval to one hour.

(From the top of my head this would affect at least occurrences filtering, if it is done using the old method ("occurrences": "eval: value > :::check.occurrences|60:::") to schedule handling. )

Hydrochoerus avatar Oct 12 '16 14:10 Hydrochoerus

Tangentially related, nagios has this feature. It has a check_interval and a retry_check_interval. The retry check interval can be more frequent which allows for less frequent checking but more frequent confirmation of failure. In using sensu and tuning checks to avoid false positives using the occurrence filtering I've often found myself thinking about this feature.

fessyfoo avatar Oct 12 '16 18:10 fessyfoo

Implementation should be easy for standalone but it seems to be difficult for normal checks.

runningman84 avatar Oct 29 '16 21:10 runningman84

Is there any reason to not add this retry logic in the checks themselves? Many of the checks already have this functionality and expose them as arguments. I prefer each check to contain its own retry logic as is very use case specific and tied to the version of the plugin you are using.

majormoses avatar May 13 '17 17:05 majormoses

I do not like a retry in the check itself. The scheduling should be controlled by the sensu client instead of a specific service check. I do not like long running checks because you get the output quite late. You do not know if the check hangs because of a retry or other problem.

In our setup we only use standalone checks and I would really appricate a feature like retry_check_interval.

runningman84 avatar May 15 '17 10:05 runningman84

I understand and agree that long running checks are bad. It makes more sense in the context of stand alone for sure, this would put extra load on the sensu server when there are problems and if there is one thing I want working well during an incident it's my monitoring server.

I definitely see the value of this feature it just needs to be considered very carefully. I am now seeing this through different lenses and feel this has nothing to do with retries. I think that this more accurate to say its a interval for shortening the check interval when an incident is in a non desired state (not sure if it should apply to != ok, critical exclusively, or better yet make that configurable per check). I think something like recovery_interval would better represent this.

majormoses avatar May 15 '17 14:05 majormoses