bazel icon indicating copy to clipboard operation
bazel copied to clipboard

BES upload failures override build status exit codes, making build success/failure detection unreliable

Open mrmeku opened this issue 8 months ago • 1 comments

Description

When a BES (Build Event Service) upload fails, Bazel changes its exit code to PERSISTENT_BUILD_EVENT_SERVICE_UPLOAD_ERROR regardless of whether the original build succeeded or failed. This behavior makes it impossible to programmatically determine the actual build status when a BES upload issue occurs.

Impact

This issue impacts CI/CD pipelines and tools that need to reliably determine if a build passed or failed. Currently, when a BES upload fails, these systems cannot distinguish between:

  • A successful build with a BES upload failure
  • A failed build with a BES upload failure

Expected Behavior

Bazel should expose some flag which allows the user to differentiate between these two cases. Currently a CI system would need to consume the BES to make such a distinction which is a high bar. Event then the only BES property which might help differentiate between exit codes is BuildFinished['overall_success'] which is considered deprecated see https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/buildeventstream/proto/build_event_stream.proto#L885

Previous Workaround

Prior to the removal of --bes_upload_mode=best_effort, this issue could be worked around by setting that flag, which would cause Bazel to report BES upload failures but still exit with the original build status code. Without this option, there is no reliable way to distinguish build failures from upload failures.

Additional Context

The underlying issue is that PERSISTENT_BUILD_EVENT_SERVICE_UPLOAD_ERROR takes precedence over BUILD_OR_PARSING_FAILURE in the exit code determination, whereas our systems had assumed the opposite precedence.

Possible Solutions

  1. Restore the --bes_upload_mode=best_effort option
  2. Add a new flag targeted for CI systems which alters exit code precedence, making exit code 1 take precedence over all others. Then any non 1 exit code could result in a passing buildhe

mrmeku avatar Apr 02 '25 19:04 mrmeku

This would be really useful for us. We've just had to remove a build analytics system, which provided useful information on any CI build performance regressions, because it caused CI builds to be marked as failed when there were no build issues.

alsutton avatar Dec 08 '25 12:12 alsutton