bazel-buildfarm icon indicating copy to clipboard operation
bazel-buildfarm copied to clipboard

support global grpc deadlines

Open luxe opened this issue 5 years ago • 0 comments

If desired, a global grpc deadline can now be enforced at service endpoints.
This was implemented as a short term mitigation for spontaneously hanging operations. We've added these configurations to the Execute service. They could be extended to other endpoints as needed.

Without these mitigation in place, buildfarm executions can hang indefinitely for operations and leave connections open causing bazel to also hang:

../../forks/bazel/bazel-bin/src/bazel-dev test //code/tools/bad_tests/bash_infinite_loop:main --remote_executor=grpc://localhost:8980 --remote_retries=0
Starting local Bazel server and connecting to it...
INFO: Invocation ID: 50da3628-0394-4855-a58d-4f4af49b1cee
INFO: Analyzed target //code/tools/bad_tests/bash_infinite_loop:main (28 packages loaded, 3257 targets configured).
INFO: Found 1 test target...
[4 / 5] Testing //code/tools/bad_tests/bash_infinite_loop:main; 99999999s remote

Resolving these hangs properly in buildfarm's implementation will require more fixes to the lifetime of an operation, and improving our ability to guarantee its expiration and termination back to the client. These timeouts allow us to avoid this correctness complexity and enforce termination at the service API level.

global grpc timeouts can be configured as followed:

execute_timeout: {
  enforce: true
  timeout: {
    seconds: 10
    nanos: 0
  }
}

wait_execute_timeout: {
  enforce: true
  timeout: {
    seconds: 10
    nanos: 0
  }
}

The result, is a clear DEADLINE_EXCEEDED message back to bazel:
DEADLINE_EXCEEDED: The grpc endpoint 'execute' has timed out because buildfarm enforces a deadline of 10s

../../forks/bazel/bazel-bin/src/bazel-dev test //code/tools/bad_tests/bash_infinite_loop:main --remote_executor=grpc://localhost:8980 --remote_retries=0
Starting local Bazel server and connecting to it...
INFO: Invocation ID: f03501ac-bba7-4587-90f2-9e912b54ef33
INFO: Analyzed target //code/tools/bad_tests/bash_infinite_loop:main (28 packages loaded, 3257 targets configured).
INFO: Found 1 test target...
ERROR: /home/laptop/Desktop/unilang/source/code/tools/bad_tests/bash_infinite_loop/BUILD:5:8:  failed (Exit 34): DEADLINE_EXCEEDED: The grpc endpoint 'execute' has timed out because buildfarm enforces a deadline of 10s. Note: Remote connection/protocol failed with: execution failed DEADLINE_EXCEEDED: The grpc endpoint 'execute' has timed out because buildfarm enforces a deadline of 10s
Target //code/tools/bad_tests/bash_infinite_loop:main up-to-date:
  bazel-bin/code/tools/bad_tests/bash_infinite_loop/main
INFO: Elapsed time: 16.523s, Critical Path: 10.77s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
//code/tools/bad_tests/bash_infinite_loop:main                        NO STATUS

FAILED: Build did NOT complete successfully

Note the difference in grpc endpoint when using --remote_retries. In this particular case, execute will fail and bazel will begin using "wait execution": DEADLINE_EXCEEDED: The grpc endpoint 'wait execution' has timed out because buildfarm enforces a deadline of 10s

../../forks/bazel/bazel-bin/src/bazel-dev test //code/tools/bad_tests/bash_infinite_loop:main --remote_executor=grpc://localhost:8980 --remote_retries=1                                      34
INFO: Invocation ID: 44b8bd6f-51a7-423c-922a-799fcfca996d
INFO: Analyzed target //code/tools/bad_tests/bash_infinite_loop:main (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
ERROR: /home/laptop/Desktop/unilang/source/code/tools/bad_tests/bash_infinite_loop/BUILD:5:8:  failed (Exit 34): DEADLINE_EXCEEDED: The grpc endpoint 'wait execution' has timed out because buildfarm enforces a deadline of 10s. Note: Remote connection/protocol failed with: execution failed DEADLINE_EXCEEDED: The grpc endpoint 'wait execution' has timed out because buildfarm enforces a deadline of 10s
Target //code/tools/bad_tests/bash_infinite_loop:main up-to-date:
  bazel-bin/code/tools/bad_tests/bash_infinite_loop/main
INFO: Elapsed time: 20.518s, Critical Path: 20.14s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
//code/tools/bad_tests/bash_infinite_loop:main                        NO STATUS

FAILED: Build did NOT complete successfully

luxe avatar Oct 10 '20 22:10 luxe