swift-nio icon indicating copy to clipboard operation
swift-nio copied to clipboard

Provide an EventLoop latency monitoring solution

Open weissi opened this issue 1 year ago • 2 comments

In SwiftNIO, it's super important to have consistently low EL latency. It's also fairly easy to monitor EL latency but it's still a fair amount of finicky code to write that essentially any real prod deployment should have.

NIO should provide an off-the-shelf (but opt-in) solution where it will just report the EL latencies on a certain schedule such that the user can log/ingest into metrics system/...

Frameworks like Vapor/Smoke could should probably enable this by default and at the very least log something if we go above (say) 500 ms because the EL latency presents the latency floor a NIO application can achieve. In other words: If you promised (SLI/SLO or just to your boss) that you p999 of request latency is less than a second, then you'll struggle to do that if you frequently see 1s+ event loop blockages because if an event loop is blocked it won't do anything and any request may land on any event loop. So even a basic health check may take many seconds (potentially causing a time out --> unhealthy) if one or more of the ELs are blocked.


FWIW, as a rough sketch this is what users can do today: Use an external system -- say Dispatch -- to schedule a task once per second per event loop and see how fast each event loop runs that task. Something basic as

for el in group.makeIterator() {
    useDispatchToScheduleSomethingOncePerSecond {
        let tSchedule = DispatchTime.now()
        let tRun = try await el.execute {
            return DispatchTime.now()
        }.get()

        let elLatency = tRun - tSchedule 
        reportToMetricsSystem("el-latency", "\(el)", elLatency)
    }
}

the use the elLatency as a regular metric and observe it, possibly with alerts if it ever goes above some threshold.

Also related: An old gist of mine which implements an event loop blockage checker (in a slightly different but equally valid way): https://gist.github.com/weissi/f789b15624d9d956f1f98b37a210ad12

weissi avatar Apr 24 '23 10:04 weissi

Wouldn't it make sense to (optionally) continuously monitor the EL latency (the cost is ~40ns for timestamps + a few ns to record the sample, so <50ns in practice) - I'd pitch our port of Gil Tenes HDR Histogram as a great candidate for keeping track of the samples (but understand NIO may not be able to take on external dependencies) - then one can get full fidelity latency distributions. This is very cheap and can be analysed and plotted then with e.g. the HDR histogram plotter.

hassila avatar Apr 24 '23 10:04 hassila

We definitely can't take the dependency, but it's a good suggestion for a possible API shape that it should be possible to pass whatever data we do produce into such a system.

Lukasa avatar Apr 24 '23 10:04 Lukasa