sledge-serverless-framework icon indicating copy to clipboard operation
sledge-serverless-framework copied to clipboard

Poor Tail Latency in Concurrency Experiment

Open bushidocodes opened this issue 4 years ago • 0 comments

The concurrency experiment executes the "empty" workload (which calls printf("S\n") and immediately exits) with high levels of concurrency ranging from 1 to 100. The pattern of how hey is invoked here exhibits extreme bursts of requests. 0 -> 1 -> 0 -> 20 -> 0 -> 40 -> 0 -> 60 -> 0 -> 80 -> 0 -> 100. The spec.json defines a single module, so all requests have the same relative deadline. As such, FIFO and EDF should behave similarly.

During a recent run of this experiment, I witnessed p100 tail latency across all scheduling policies of up to 20s. The logs and charts for this runs are as follows:

server_concurrency_gnuplots.zip

Note that based on the "offset" reported by hey, it appears that the worst tail latency occurs in batches of roughly 10ms.

Based on all this, I suspect that (assuming this is reproducible externally), there is a context/sandbox switching bug. I believe that we aren't seeing it in other experiments because this experiment triggers a significantly higher rate of sandbox switches due to:

  • "empty" workload executing to completion very quickly
  • The bursty request pattern of the driver script.

My debugging suggestion for this bug would be to validate the bug as follows:

  1. Sanity Check hey parameters
  2. Run on CloudLab to see if this relates to my hacky home office setup
  3. Refactor experiment to use server-side reporting
  4. If this is visible with server-side metrics on cloudlab, then there is likely a bug
  5. I suspect that focusing on the state of the system during the 10ms of reported long running tasks is a good bet. The question is why these requests were unable to propagate through the system? Did a worker spin? Did the sandboxes block and get caught in epoll? Looking at the time of sandbox states for these sandboxes might help.

Possibly relevant issues:

  • #224
  • #219
  • #66

bushidocodes avatar Jun 14 '21 15:06 bushidocodes