buildbuddy
buildbuddy copied to clipboard
Remove podman overhead when rapidly recycling runners
(Context: https://buildbuddy-corp.slack.com/archives/C057TAUAQ7P/p1716579465393109?thread_ts=1716243012.902439&cid=C057TAUAQ7P)
When runner recycling is enabled, we currently have significant overhead introduced by podman due to running podman unpause
, podman exec
, and podman pause
for each execution. pause
and unpause
each add about 20ms of overhead, and exec
adds 60-100ms of overhead.
This PR adds 2 optimizations to eliminate the overhead from these podman commands, particularly in the case where the same runner is being reused at a high frequency:
- Delayed pause: instead of immediately pausing after execution is done, wait until the runner has idled for a few seconds (the idle duration is configurable via flag). This avoids unnecessary pause/unpause cycles in the case where several tasks are queued up and are eligible to reuse the same runner. In practice, runners likely aren't using significant CPU while idle, so we'll likely be burning very few CPU cycles by keeping the runners alive while they are not assigned a task. However, a future improvement could be to base the pause delay on how much CPU the runner is trying to use while idle (well-behaved runners should be using ~0 CPU while idle, and so they don't really need to be paused at all).
-
Exec server: instead of using
podman exec
, run a gRPC server inside the container and send commands to it over a socket (we can use thevmexec
server implementation which we already use to send commands to Firecracker VMs).
Results:
-
Benchmark: sequential NOP actions: With sequential NOP tasks run locally with
recycle-runner=true
, this optimization increases throughput from 7.5 actions/s => 79.3 actions/s (about 10X increase in throughput). -
Benchmark: build //server: When building
//server
withrecycle-runner=true
set on all actions and bypassing cache, this reduces average (N=24) build time from 62s to 53s (~15% reduction) and p90 from 63s to 54s. In practice, not all actions haverecycle-runner=true
but we have some customers that are heavily relying on it - so we should be able to execute these actions at a higher rate, reducing the need for autoscaling and burning less CPU onpodman
commands.
Related issues: N/A