watchman
watchman copied to clipboard
Retries confusion
I find "max retries" confusing.
"Max" is confusing because it's actually the highest allowed number of attempts, and N+1 attempts cause the check to fail. Intuitively, if I set "max retries" to 3 for a failing service, I would expect it to ping 3 times, and on the third failure declare it failed. But watchman doesn't fail the check until the 4th failure. (This corresponds to < vs <= on line 124 in pinger.clj.) Maybe the reason for this is that the first attempt isn't a "retry"; only subsequent attempts are.
I think "retries" is also not necessarily the right word. To me, that implies that if the ping fails or times out, watchman would immediately try again, and after N tries it would declare the check failed. But watchman doesn't "retry" until the next time the check would normally be performed.
The net result is that if your interval is 1 minute, and "max retries" is 3, both of which seem reasonable, a completely dead check won't fail for 4 minutes, which seems unreasonable.
I think better behavior would be to change the <= to <, and to make watchman immediately retry failed pings up to N times. At that point the check is declared failed. Subsequently, when the ping interval comes around again for the failing check, the ping would only be done once to see if it's back up or not.
Then if you have a 1 minute check interval and max retries = 3, the check will fail after no more than 3 * timeout.
The current behavior is actually more like "max allowed consecutive failures".
I agree about the confusing terminology. However, given my past experiences setting up alerts I do think it's valuable to have a "max allowed consecutive failures" setting (which is what max retries currently is); it allows one to say "only bug me if the service is down for more than N minutes", which is handy for systems with not-super-tight SLA's.
Yeah, there are two different kinds of failure parameters one might want to control:
- Max consecutive failures -- you only want to know when the service has been down for some amount of time, which is good if you expect it to go down for a minute or two here and there and you don't care about that (as you pointed out).
- Max retries -- you want to know as soon as something's wrong, but like any HTTP/network request there can always be transient failures and immediately retrying the request will keep the noise down.
I'm going to rename this to max_consecutive_failures to clear up the confusion.