raintank-collector
raintank-collector copied to clipboard
at which point do collectors shut themselves off?
i know that if collectors realize that everything they monitor is unreachable, they stop sending errors under the reasoning there is something wrong with the collector or its connectivity itself.
how long does this take? the alerting config where you define "x errors for y points in a row" depends on this. around 30s seems reasonable? there's always the risk there's low-frequency checks every 60 or 120s, ideally we would wait to see errors for all of them, but that could be too long for the value those bring. if people monitor everything every 10s they would have to wait 12 steps to cover at least the "collector-shutdown" interval.
The collectors dont have anything implemented at this stage to shut themselves down. When we add it, it will probably just use the raindex as a measure. ie if x% of the alexa top 50 are unreachable then shutdown. In this scenario the shutdown delay would be controlled and would not be more then 30seconds (3 consecutive failures at 10second interval) we can even reduce this as we can check the alex sites every second if desired.
ok so for now @mattttt and i will assume it takes at least 30s for collectors to shut themselves off, so customers should wait at least 30s before alerting, and hopefully this gets implemented before alerting goes live ~ monitorama.
So just to update this ticket with the latest thoughts around raindex since it hasn't been talked about in while.
- still think the concept is cool and has validity - even more so than months ago
- raindex on icmp might be the best and most consistent measurement (using the mean value of latency and the overall loss %) as opposed to http.
- raindex is per collector and its initial use case is a mechanism for collectors to go offline or online (later on it could be used as a "trust" rating for a particular collectors measurements or something more sophisticated)
- overall raindex for a collector is tbd but is something like a moving average or 90th percentile of latency and loss across the basket of sites.
- the basket of raindex sites should be 50-100 sites, with as little common infra dependency as possible
- raindex can initially be used as an alert for ops to investigate a collector and potentially disable it, before we decide to further automate the process.
I dont have anything to add.