sledge-serverless-framework
sledge-serverless-framework copied to clipboard
Poor Tail Latency in Concurrency Experiment
The concurrency experiment executes the "empty" workload (which calls printf("S\n") and immediately exits) with high levels of concurrency ranging from 1 to 100. The pattern of how hey is invoked here exhibits extreme bursts of requests. 0 -> 1 -> 0 -> 20 -> 0 -> 40 -> 0 -> 60 -> 0 -> 80 -> 0 -> 100. The spec.json defines a single module, so all requests have the same relative deadline. As such, FIFO and EDF should behave similarly.
During a recent run of this experiment, I witnessed p100 tail latency across all scheduling policies of up to 20s. The logs and charts for this runs are as follows:
server_concurrency_gnuplots.zip
Note that based on the "offset" reported by hey, it appears that the worst tail latency occurs in batches of roughly 10ms.
Based on all this, I suspect that (assuming this is reproducible externally), there is a context/sandbox switching bug. I believe that we aren't seeing it in other experiments because this experiment triggers a significantly higher rate of sandbox switches due to:
- "empty" workload executing to completion very quickly
- The bursty request pattern of the driver script.
My debugging suggestion for this bug would be to validate the bug as follows:
- Sanity Check hey parameters
- Run on CloudLab to see if this relates to my hacky home office setup
- Refactor experiment to use server-side reporting
- If this is visible with server-side metrics on cloudlab, then there is likely a bug
- I suspect that focusing on the state of the system during the 10ms of reported long running tasks is a good bet. The question is why these requests were unable to propagate through the system? Did a worker spin? Did the sandboxes block and get caught in epoll? Looking at the time of sandbox states for these sandboxes might help.
Possibly relevant issues:
- #224
- #219
- #66