eventd-rfc icon indicating copy to clipboard operation
eventd-rfc copied to clipboard

Guaranteed delivery: a misfeature

Open nyarly opened this issue 9 years ago • 1 comments

This is going to come off more challenging than I intend, but let me argue that there are network activities (logging among them) that you want to let fail.

Let me start with this assumption: you want guaranteed delivery because when servers are under load, it helps in determining the causes of the load, but at the same time, servers under load produce load on the local network and log entries get (very confusingly) lost. If this is a non-goal, then I'm on my own in the weeds here.

However, the converse is also true: if logging is guaranteeing delivery, its own segment of network load increases, and it contributes to the overall congestion. IMO, in a congestion event you want logging to back off - record it locally or drop entries in order to let the actual application ride out the storm.

My personal context here is this: we set up an rsyslog regime to centralize the logging of a (terrible) multi-host Java application. Long story short, rsyslog saturated the network and the application itself broke.

In no small part this had to do with logging being overused and poorly filtered (everything was at FATAL), but part of the lesson for me was that logging needs to be the first packet against the wall when saturation comes.

Maybe there's a larger issue having to do with how logging is filtered, and a better structured log entry (and metadata on streams) would definitely help with that.

nyarly avatar Feb 23 '16 19:02 nyarly

I understand and I do sympathize with your case. I think it's important to say where loss should take place. In your case, where the load of log transmission results in increased load that causes additional issues, I think it's fine for you to have a mechanism in place on that machine to drop logs. That's your business and not something anyone else can dictate. But when you decide "ok, I want these somewhere else", you should have assurance that the logs will actually get there.

It's that later case, the transmission of the logs from A to B, that this governs. A tool that utilizes this protocol can take whatever procedures it wants into play to decide what logs to actually transmit. But using the transmission fabric as a place to throttle and lose logs is not something anyone wants. That decision, to throttle and discard, should be made by a program dealing with logs pre-transmissions.

evanphx avatar Apr 07 '16 02:04 evanphx