bazel icon indicating copy to clipboard operation
bazel copied to clipboard

`bes_upload_mode=fully_async` can lead to incomplete profile uploads

Open iamricard opened this issue 1 year ago • 7 comments

Description of the bug:

When streaming BEP with fully_async as the upload mode, the uploaded profiles can sometimes be incomplete. This is more likely to happen if the profile upload is slow (for any reason).

We think this happens because:

  • In BuildEventServiceModule#beforeCommand we see a call to waitForPreviousInvocation.
  • Waiting for the previous invocation includes waiting for BES and file uploads to finish.
  • But the profiler starts before the wait which leads to bazel to create an new output stream with append=false on command.profile.gz.

Which category does this issue belong to?

Core, Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Apologies ahead of time because the code is kind of a monster, but it was hard to write a test case for this. I wrote a test case for this bug in https://github.com/iamricard/bazel/commit/f16b0119f53296590d01cfe79038e6b615e789b7#diff-b9f568e8bd3518443c59fae8aae1ebdbeba1662bb8157f23dc70c77ffe16dd65R894-R937.

Which operating system are you running Bazel on?

MacOS, Linux

What is the output of bazel info release?

8.0.0-pre.20240730.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

[email protected]:iamricard/bazel.git
f16b0119f53296590d01cfe79038e6b615e789b7

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

  • This can probably be mitigated by users who use bazelisk's tools/bazel if they generate a unique profile name for every invocation.
  • It's probably not an issue for users using --experimental_stream_log_file_uploads.

iamricard avatar Aug 12 '24 12:08 iamricard

I dug through the code a bit and it looks to me like a FileOutputStream with append=false is created for the file command.profile.gz (assuming no specific profile file name is set via --profile). So indeed it looks like the existing file is used to overwrite existing data, without deleting the file before (or writing it to a "unique" path). I wonder whether the problem would at least be alleviated by first deleting the old file, and then creating it anew.

See https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/runtime/BlazeRuntime.java;l=334;drc=914db36648ef734b9b534d2a37907b9505534399 which ends up in https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/vfs/AbstractFileSystem.java;l=137;drc=914db36648ef734b9b534d2a37907b9505534399

saraadams avatar Aug 19 '24 07:08 saraadams

It seems like it could be solved by https://github.com/bazelbuild/bazel/issues/18006.

iamricard avatar Aug 22 '24 14:08 iamricard

@meisterT Adding you for awareness - do one of the two options to potentially address this problem sound good to you?

saraadams avatar Aug 22 '24 17:08 saraadams

Unique profiles, keeping the last N of them (configurable), and symlinking the last one as command.profile.gz sounds reasonable to me.

Side note: should we flip the default of --experimental_stream_log_file_uploads to true?

meisterT avatar Aug 27 '24 08:08 meisterT

Unique profiles, keeping the last N of them (configurable), and symlinking the last one as command.profile.gz sounds reasonable to me.

Do you think work on this might started/completed soon, or do you have a rough timeline?

saraadams avatar Aug 27 '24 09:08 saraadams

#18006 is currently unassigned and marked as help wanted. Let me know if you want to take a stab at it!

meisterT avatar Aug 27 '24 11:08 meisterT

Side note: should we flip the default of --experimental_stream_log_file_uploads to true?

Unfortunately, there's no public protocol for streaming uploads and consequently no implementation of this flag in Bazel.

benjaminp avatar Aug 27 '24 21:08 benjaminp