carbon-relay-ng icon indicating copy to clipboard operation
carbon-relay-ng copied to clipboard

Feature request: Spool to disk when connection is slow rather than dropping metrics

Open rhyss opened this issue 7 years ago • 5 comments

We have been trying to implement some form of flow control in out metrics pipeline. The pipeline is upstream servers -> carbon-relay-ng on own host -> carbon-relay-ng on graphite host -> go-carbon. Sometimes in our pipeline we can get quite large metric spikes (e.g. after a network issue is resolved).

It would be great if carbon-relay-ng spooled when the network is slow rather than dropping packets since this would lead to less lost metrics. An alternative is to spool to disk when the throughput of metrics goes above a configurable value.

rhyss avatar Jun 14 '18 16:06 rhyss

I think it is what the 'spool' parameter actually does, I read in README

graphite routes supports a per-route spooling policy. (i.e. in case of an endpoint outage, we can temporarily queue the data up to disk and resume later)

As I never used it, I can't confirm it actually works like this, but it should.

daks avatar Jun 18 '18 07:06 daks

Unfortunately it only spools when there in an endpoint outage, not on a slow connection.

rhyss avatar Jun 18 '18 07:06 rhyss

I'll +1 this request as I was coming here to ask for this feature for an incident that just happened to me.

In my real world situation, something about a Graphite host became unhealthy at the OS level and over an hours time its CPU workload ramped up to 100% of all threads available to it. The node was still up and trying to function, but wouldn't allow SSH, OOM killer was cancelling the monitoring agent, etc.

Four separate VMs running carbon-relay-ng and forwarding metrics to this Graphite node reported dropped metrics due to 'slow connection'. It dropped what looks like its full stream of 3 million metric values per minute to this node. When the host was forced to reboot, carbon-relay-ng identified the socket as lost and started spooling to disk. After the node was restored, the four VMs unspooled their data to catch up. Everything worked well -- except the established connection being identified as actually down.

So what would have helped us here is some sort of code in carbon-relay-ng that would have noticed it being slow (or perhaps a discard rate of 90+% of its most recent volume) and perhaps attempted to close the connection and re-establish it. This would have identified the node as down and spooled those metrics to disk.

Thanks for considering this!

maxwax avatar Jul 07 '18 20:07 maxwax

It looks like I could have manually gone into the carbon-relay-ng webui or cli and offlined the node in question here, then re-enabled it manually ?

But in this case, my pagerduty alert only came through when the situation was bad enough that the OOM killer knocked out the monitoring agent and by then the situation was already losing metrics. So a code solution is still ideal.

Regards,

maxwax avatar Jul 07 '18 20:07 maxwax

+1

yunstanford avatar Dec 07 '18 23:12 yunstanford