carbon-relay-ng
                                
                                 carbon-relay-ng copied to clipboard
                                
                                    carbon-relay-ng copied to clipboard
                            
                            
                            
                        Feature request: Spool to disk when connection is slow rather than dropping metrics
We have been trying to implement some form of flow control in out metrics pipeline. The pipeline is upstream servers -> carbon-relay-ng on own host -> carbon-relay-ng on graphite host -> go-carbon. Sometimes in our pipeline we can get quite large metric spikes (e.g. after a network issue is resolved).
It would be great if carbon-relay-ng spooled when the network is slow rather than dropping packets since this would lead to less lost metrics. An alternative is to spool to disk when the throughput of metrics goes above a configurable value.
I think it is what the 'spool' parameter actually does, I read in README
graphite routes supports a per-route spooling policy. (i.e. in case of an endpoint outage, we can temporarily queue the data up to disk and resume later)
As I never used it, I can't confirm it actually works like this, but it should.
Unfortunately it only spools when there in an endpoint outage, not on a slow connection.
I'll +1 this request as I was coming here to ask for this feature for an incident that just happened to me.
In my real world situation, something about a Graphite host became unhealthy at the OS level and over an hours time its CPU workload ramped up to 100% of all threads available to it. The node was still up and trying to function, but wouldn't allow SSH, OOM killer was cancelling the monitoring agent, etc.
Four separate VMs running carbon-relay-ng and forwarding metrics to this Graphite node reported dropped metrics due to 'slow connection'. It dropped what looks like its full stream of 3 million metric values per minute to this node. When the host was forced to reboot, carbon-relay-ng identified the socket as lost and started spooling to disk. After the node was restored, the four VMs unspooled their data to catch up. Everything worked well -- except the established connection being identified as actually down.
So what would have helped us here is some sort of code in carbon-relay-ng that would have noticed it being slow (or perhaps a discard rate of 90+% of its most recent volume) and perhaps attempted to close the connection and re-establish it. This would have identified the node as down and spooled those metrics to disk.
Thanks for considering this!
It looks like I could have manually gone into the carbon-relay-ng webui or cli and offlined the node in question here, then re-enabled it manually ?
But in this case, my pagerduty alert only came through when the situation was bad enough that the OOM killer knocked out the monitoring agent and by then the situation was already losing metrics. So a code solution is still ideal.
Regards,
+1