carbon-relay-ng icon indicating copy to clipboard operation
carbon-relay-ng copied to clipboard

Rate limit on spooled metrics ?

Open vidhu5269 opened this issue 8 years ago • 4 comments

Currently, carbon-relay-ng spools metrics if the destination endpoint is down but when it tries to send those metrics back, it doesn't limit the rate of metrics sent. In one of our Staging setups, we get metrics at a steady rate 40K per minute but due to a network issue in that data center one night, the relay couldn't talk to one of the carbon cache machines for the whole night.

In the morning when we got the issue resolved, the number of incoming metrics increased to 1.3 M per minute and after some time brought the cache down because it couldn't sustain that big of a load. The carbon cache is running on VM and sharing its disk with other machines so there is a limit to how much I/O we can expect.

I looked through the documentation but couldn't find a way of limit the rate of spooled metrics coming down to the caches. Please point me to the doc in case such configurations exists. If not, does it make sense to add this functionality? It will help us to control the transition to steady state better and not cause any more failures when the issue is supposed to be getting resolved.

vidhu5269 avatar Feb 03 '17 00:02 vidhu5269

I think the most elegant solution would be if carbon cache would provide backpressure . ie read data from the connection at the pace it can handle. That will slow down the relay writing to the connection.

Dieterbe avatar Feb 03 '17 06:02 Dieterbe

The rate limit has to be applied only for the spooled metrics and not on the regular incoming data. Cache is already handling the queue for "processed metrics" to prevent loss due to I/O latency, if it has to handle back pressure as well, then probably be too much to handle for it.

Also, spooling is a pretty handy feature of carbon-relay-ng but this "batch push" creates an unprecedented data flow which the system may not have scaled for. We want to spool as much data as possible to account for any prolonged failures but it will need a non-linear scaling of underlying carbon caches to meet this flow. On the contrary, a limit at the relay will mean long but smooth transition to steady state and does not need the non-linear scaling for caches.

All my arguments are based on the current carbon cache implementation and not an alternate approach which may or should come in the future. Does it make sense?

vidhu5269 avatar Feb 03 '17 20:02 vidhu5269

recently - #210 - the relay gained a bunch more config options to tune the sleep in between reading metrics off the spool. see https://github.com/graphite-ng/carbon-relay-ng/blob/master/docs/routes.md#carbon-destination for more information. let me know how it goes.

Dieterbe avatar Sep 13 '17 20:09 Dieterbe

Great ! Here, tuning unspoolsleep is very helpful. I was setting MAX_CACHE_SIZE on my carbon-cache to limit rate, but It was flooding and some values were dropped. with unspoolsleep = 50, no metrics lost :) Thanks much.

hamelg avatar Mar 06 '18 07:03 hamelg