BES upload failures override build status exit codes, making build success/failure detection unreliable
Description
When a BES (Build Event Service) upload fails, Bazel changes its exit code to PERSISTENT_BUILD_EVENT_SERVICE_UPLOAD_ERROR regardless of whether the original build succeeded or failed. This behavior makes it impossible to programmatically determine the actual build status when a BES upload issue occurs.
Impact
This issue impacts CI/CD pipelines and tools that need to reliably determine if a build passed or failed. Currently, when a BES upload fails, these systems cannot distinguish between:
- A successful build with a BES upload failure
- A failed build with a BES upload failure
Expected Behavior
Bazel should expose some flag which allows the user to differentiate between these two cases. Currently a CI system would need to consume the BES to make such a distinction which is a high bar. Event then the only BES property which might help differentiate between exit codes is BuildFinished['overall_success'] which is considered deprecated see https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/buildeventstream/proto/build_event_stream.proto#L885
Previous Workaround
Prior to the removal of --bes_upload_mode=best_effort, this issue could be worked around by setting that flag, which would cause Bazel to report BES upload failures but still exit with the original build status code. Without this option, there is no reliable way to distinguish build failures from upload failures.
Additional Context
The underlying issue is that PERSISTENT_BUILD_EVENT_SERVICE_UPLOAD_ERROR takes precedence over BUILD_OR_PARSING_FAILURE in the exit code determination, whereas our systems had assumed the opposite precedence.
Possible Solutions
- Restore the
--bes_upload_mode=best_effortoption - Add a new flag targeted for CI systems which alters exit code precedence, making exit code 1 take precedence over all others. Then any non 1 exit code could result in a passing buildhe
This would be really useful for us. We've just had to remove a build analytics system, which provided useful information on any CI build performance regressions, because it caused CI builds to be marked as failed when there were no build issues.