buildbuddy icon indicating copy to clipboard operation
buildbuddy copied to clipboard

Remove podman overhead when rapidly recycling runners

Open bduffany opened this issue 9 months ago • 1 comments

(Context: https://buildbuddy-corp.slack.com/archives/C057TAUAQ7P/p1716579465393109?thread_ts=1716243012.902439&cid=C057TAUAQ7P)

When runner recycling is enabled, we currently have significant overhead introduced by podman due to running podman unpause, podman exec, and podman pause for each execution. pause and unpause each add about 20ms of overhead, and exec adds 60-100ms of overhead.

This PR adds 2 optimizations to eliminate the overhead from these podman commands, particularly in the case where the same runner is being reused at a high frequency:

  • Delayed pause: instead of immediately pausing after execution is done, wait until the runner has idled for a few seconds (the idle duration is configurable via flag). This avoids unnecessary pause/unpause cycles in the case where several tasks are queued up and are eligible to reuse the same runner. In practice, runners likely aren't using significant CPU while idle, so we'll likely be burning very few CPU cycles by keeping the runners alive while they are not assigned a task. However, a future improvement could be to base the pause delay on how much CPU the runner is trying to use while idle (well-behaved runners should be using ~0 CPU while idle, and so they don't really need to be paused at all).
  • Exec server: instead of using podman exec, run a gRPC server inside the container and send commands to it over a socket (we can use the vmexec server implementation which we already use to send commands to Firecracker VMs).

Results:

  • Benchmark: sequential NOP actions: With sequential NOP tasks run locally with recycle-runner=true, this optimization increases throughput from 7.5 actions/s => 79.3 actions/s (about 10X increase in throughput).
  • Benchmark: build //server: When building //server with recycle-runner=true set on all actions and bypassing cache, this reduces average (N=24) build time from 62s to 53s (~15% reduction) and p90 from 63s to 54s. In practice, not all actions have recycle-runner=true but we have some customers that are heavily relying on it - so we should be able to execute these actions at a higher rate, reducing the need for autoscaling and burning less CPU on podman commands.

Related issues: N/A

bduffany avatar May 24 '24 19:05 bduffany