swift-nio
swift-nio copied to clipboard
[Feature Proposal] Sampling pipeline tracing functionality
When running NIO applications in production it is possible to encounter rare, probabilistic issues. These issues happen with relatively low frequency and are, with the information available to the application developer, unpredictable. Generally they are the result of unforeseen interactions of components in a fairly complex system, often triggered by unexpected or unusual timing or asynchrony behaviours.
It is very rarely the case that there is sufficiently good logging in user code to diagnose this kind of interaction. In situations like this we tend to advise things like the WritePCAPHandler
from NIOExtras, or a "Log everything handler" to be inserted into channel pipelines to get further diagnostics. This idea is reasonable, but it requires a custom intervention that is not normally present in user code. Additionally, generating these logs can be expensive, and so users are often forced to come up with systems where they only apply these handlers when appropriate, or risk slowing down their systems excessively in order to gather the logs.
NIO can and should do more to help here. In particular, NIO is deeply aware of the ChannelPipeline
and all messages that pass through it. ChannelPipeline
is a first-party class that allows no subclassing, as is ChannelHandler
, and all messages on a channel pipeline must pass through them. This gives us a powerful place to add introspection, if we wanted to do so.
Using those hooks, we could build a "pipeline tracer". This is, conceptually, something that logs out all events that occur to a ChannelPipeline
. A hypothetical output might be:
[pipeline=0xdeadbeef index=1 handlertype=NIOSSLClientHandler] handler added
[pipeline=0xdeadbeef index=2 handlertype=HTTPRequestDecoder] handler added
[pipeline=0xdeadbeef index=3 handlertype=HTTPResponseEncoder] handler added
[pipeline=0xdeadbeef] channel called fireChannelActive
[pipeline=0xdeadbeef index=0 handlertype=HeadChannelHandler] received channelActive
[pipeline=0xdeadbeef index=0 handlertype=HeadChannelHandler] called fireChannelActive
[pipeline=0xdeadbeef index=1 handlertype=NIOSSLClientHandler] received channelActive
[pipeline=0xdeadbeef index=1 handlertype=NIOSSLClientHandler data=ByteBuffer(212 bytes)] called write
...snip...
That is, this would produce a sequential log of events happening on the pipeline from the perspective of the pipeline. Because of the central position of the ChannelPipeline, it is not possible for events to go missing, nor is it possible for us to get confused about the ordering: we will always know exactly what happened and in what order.
This would be a powerful debugging tool. Structured carefully it could even allow programmatic analysis of the data, being able to visually draw out the state of the ChannelPipeline as well as the events that fire. Timestamps can potentially be included as well to give an idea of the delay between certain events. This can be thought of as "logging on steroids": it gives a complete picture of the status of the channel pipeline without requiring that any of the channel handlers participate in the logging process.
We can potentially make this richer by providing hooks that allow users to customise the traced representation of both data types and channel handlers. This would allow users to, for example, include the state of an object at the time the log was emitted. If we wanted to we could go even further and provide a hook on ChannelHandlerContext
to inject entirely additional trace information, allowing tools like the HTTP/2 channel handler to log out state machine changes into the same trace.
The reason this behaviour is not on by default is because it is incredibly expensive. Logging each event like this will drastically increase the cost of pipeline walks, and it would be unbearable to have it on by default. To this end, if we wanted to add this functionality we'd almost certainly want to make it have a sampling functionality.
What I mean by this is that users would (probably on a bootstrap?) configure the "rate" at which they want pipeline tracing to occur. The default value would be 0
(i.e. never) but users could, say, pick a ratio of channels that will have this tracing enabled. By and large we would recommend that users tune this value in production to the highest value that they can tolerate the performance cost of. When the ratio is sufficiently low (e.g. 1/5000) the branch predictor should make most channels pay nothing for this logging, as the branch predictor will correctly conclude that the logging branches will not be taken. The affected channels will see a performance drop associated with this logging, which will be associated with latency and throughput outliers, but as long as the value is tunable I think this is acceptable. Many users will choose to plumb this value through in configuration, meaning that if a deployment is exhibiting unhealthy behaviour they will reload some of their systems with higher sampling rates to try to isolate the issues.
I'm curious as to what the rest of the team, as well as the wider community, think about this idea. Would it be helpful for investigating production incidents?