bazel icon indicating copy to clipboard operation
bazel copied to clipboard

`bazel build //X:all` records target completion events in the BEP stream with `file://` URIs, but sometimes doesn't write files to disk

Open froydnj opened this issue 1 year ago • 19 comments

Description of the bug:

We use --build_event_json_file to watch for target completion events and run some post-processing on the produced files in tandem with Bazel running. With Bazel 6.3.2, this worked fine, though we did run into #17798, and therefore had to add some polling to wait for the files being written to disk. This polling has a built-in (per-target) timeout of 3 minutes, which we ~never see hit (post-processing generally completes within seconds).

When attempting to upgrade to Bazel 7.0.0, we are seeing this timeout regularly getting hit, despite remote cache hits and an otherwise quiescent system. That is, we are seeing target completion events in the BEP stream, but for a small number of those targets, the relevant files never appear on disk (at least within the 3-minute timeout; bazel build finishes within ~40s, so the machine is not particularly overloaded). There does not seem to be any commonality between the files that are not present between different bazel build runs; a given target might not be downloaded in one run and be downloaded in another. This behavior occurs with both fully-cached and mostly-cached builds.

The same behavior occurs when building with //X:....

We do occasionally see successful builds with Bazel 7, but they are rare (~1%, maybe less). It is possible, of course, that raising the timeout would work, but given that we ~never hit the timeout with Bazel 6 and ~always hit the timeout with Bazel 7, increasing the timeout would represent a significant performance hit.

We have checked the BEP streams between Bazel 6 and Bazel 7 (fully cached build for both) and there are only trivial differences. We took GRPC logs from our remote cache (fully cached build for both) and again there are only trivial differences.

We are using --remote_download_toplevel with both versions. This issue happens with remote caching enabled and remote execution disabled, which is our CI configuration (we know...); we have a separate configuration with remote caching and remote execution enabled that does not appear to exhibit the problem.

We do not see this behavior if we manually expand //X:all on the command line, i.e. listing out all the relevant targets that //X:all would have built.

cc @sushain97 @ngroman @xieyuxi-stripe @coeuvre

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

We haven't been able to come up with a simple example that reproduces this outside of our codebase.

Which operating system are you running Bazel on?

x86-64 Linux

What is the output of bazel info release?

release 7.0.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Yes, this is a regression. We have not tried to isolate the relevant commit(s).

Have you found anything relevant by searching the web?

I have looked through the last several months of Github issues, but do not see anything relevant.

bazel-discuss does not appear to have anything related.

Any other information, logs, or outputs that you want to share?

No response

froydnj avatar Jan 03 '24 21:01 froydnj

I can't reproduce this issue with simple genrule example neither. Maybe this requires some complexities in the rules (e.g. macro, alias, etc.) to trigger. cc @tjgq

coeuvre avatar Jan 04 '24 04:01 coeuvre

This is a bit of a shot in the dark, but does setting --noexperimental_merged_skyframe_analysis_execution make the issue go away?

tjgq avatar Jan 04 '24 12:01 tjgq

This is a bit of a shot in the dark, but does setting --noexperimental_merged_skyframe_analysis_execution make the issue go away?

That's a pretty good shot in the dark! It does indeed make the issue go away (multiple successful builds with Bazel 7, with no failures).

froydnj avatar Jan 04 '24 17:01 froydnj

@bazel-io fork 7.0.1

meteorcloudy avatar Jan 08 '24 15:01 meteorcloudy

I added this issue as a soft release blocker for 7.0.1, since the fix is still unclear and there is a workaround. We can maybe have a 7.0.2 if this doesn't get fix in time.

meteorcloudy avatar Jan 08 '24 15:01 meteorcloudy

@bazel-io fork 7.1.0

iancha1992 avatar Jan 09 '24 18:01 iancha1992

@froydnj We looked into this issue today but still cannot reproduce it. Can you try to work out a minimal repro, or provide a full set of flags you are using?

On the other hand, were you building incrementally or cleanly?

coeuvre avatar Jan 11 '24 13:01 coeuvre

ChromeOS Bazel build might be seeing this issue. I recently added a postprocessor that consumes BEP JSON and archives relevant log files generated in bazel-out into a tarball, and observed that it sometimes fails to find output files mentioned in BEP.

In our case, skymeld is enabled, remote caching is enabled, we always start build from a clean output directory.

  • Revert CL: https://chromium-review.googlesource.com/c/chromiumos/chromite/+/5414033
  • Processor code: https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/bazel/portage/tools/process_artifacts/src/archive_logs.rs
  • Example failures (visible to Googlers only):
    • https://luci-milo.appspot.com/ui/p/chromeos/builders/cq/amd64-generic-bazel-lite-cq/b8751782561084881873/infra
    • https://luci-milo.appspot.com/ui/p/chromeos/builders/cq/amd64-generic-bazel-lite-cq/b8751753277727161265/infra
    • https://luci-milo.appspot.com/ui/p/chromeos/builders/cq/amd64-generic-bazel-lite-cq/b8751760071954968657/infra
    • https://luci-milo.appspot.com/ui/p/chromeos/builders/cq/amd64-generic-bazel-lite-cq/b8751764328730997585/infra
    • ...

nya3jp avatar Apr 03 '24 07:04 nya3jp

@nya3jp does this always happen for *_host_deps.profile.json and *_host_deps.json? If so, is there something special about how these files are generated? Maybe from an aspect, a top-level aspect, or some other commonality?

lberki avatar Apr 04 '24 08:04 lberki

The *_host_deps.profile.json and *_host_deps.json are standard output files from the build_packages action. What is special is we are using our collect_logs aspect and attaching it to the top level targets via the --aspects flag.

ismell avatar Apr 04 '24 15:04 ismell

That might actually be relevant! The code path to fetch the outputs of aspects is slightly different depending on the value of --experimental_merged_skyframe_analysis_execution.

@joeleba mind taking a look at this to see if anything stands out? In particular, I'm worried that ConcurrentPathTrie might have a race condition somewhere.

Alternatively, I could imagine that if the analysis of the aspect that promotes an artifact to a top-level artifact finishes after the artifact is built, the artifact will not get downloaded.

lberki avatar Apr 04 '24 15:04 lberki

The second alternative it is. Here is a reproduction. As expected, it requires --experimental_merged_skyframe_analysis_execution and --remote_download_outputs=toplevel .

cat > BUILD <<'EOF'
filegroup(name = "top", srcs = [":actual", "//slow"])
genrule(name = "proxy",  srcs = [":actual"], outs = ["proxy_file"], cmd = "cp $< $@")
genrule(name = "actual", srcs = [], outs = ["actual_file"], cmd = "echo ACTUAL > $@")
EOF

mkdir -p slow
mkfifo slow/BUILD

bazel build :proxy --disk_cache=/tmp/dc  # Prime the disk cache
bazel clean  # Clean the output tree so that the disk cache is hit again and actual_file is deleted
bazel build :top :proxy --disk_cache=/tmp/dc &
# Wait until :proxy builds; most principled is reading the BEP
# (I just used manual waiting for a few seconds, which is too brittle for an actual test)
echo "filegroup(name = 'slow', visibility = ['//visibility:public'])" > slow/BUILD  # Unblock analyzing :top
wait %1  # Wait for the second Bazel invocation

# Behold the absence of bazel-bin/actual_file

lberki avatar Apr 05 '24 06:04 lberki

From tracing the code with the debugger, it seems like your theory is correct, Lukacs. actual_file was first generated via :proxy and then only recognized later as a top level output by :top.

Each time a top level target is analyzed, we add its output artifacts to a Trie of pathsToDownload. This Trie is later on used to decide whether to download an output artifact or not, upon the completion of an action.

Naive question: is it feasible to just attempt to download every artifact in this Trie at the end of the build? That would take away the importance of when an artifact is registered as "top-level".

joeleba avatar Apr 05 '24 09:04 joeleba

That'll be for @coeuvre to decide -- that's definitely the simplest solution although in incurs a performance hit, since the download would start potentially much later than the moment Bazel realizes that a file needs to be downloaded.

lberki avatar Apr 05 '24 10:04 lberki

I think the idea here is that we still download action outputs during action execution if we can, but we need a final pass at the end for anything that was only marked for download after the respective action finished.

We could also consider storing top-level outputs that came from an aspect separately, so we only have to iterate those at the end, instead of the entire pathsToDownload (I'm not sure if that matters much; we can probably start without it).

tjgq avatar Apr 05 '24 11:04 tjgq

Naive question: is it feasible to just attempt to download every artifact in this Trie at the end of the build? That would take away the importance of when an artifact is registered as "top-level".

We tried a similar idea before and encountered https://github.com/bazelbuild/bazel/issues/17798.

coeuvre avatar Apr 05 '24 11:04 coeuvre

Another idea is: attempt downloading toplevel outputs immediately when we receive analyzed event from skymeld.

No, this could still trigger #17798.

coeuvre avatar Apr 05 '24 12:04 coeuvre

Are we still targeting this for 7.2.0? (first RC next Monday, the 13th)

Wyverald avatar May 06 '24 23:05 Wyverald

Yes, I am working on the fix now.

coeuvre avatar May 07 '24 13:05 coeuvre

A fix for this issue has been included in Bazel 7.2.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.2.0rc1. Thanks!

iancha1992 avatar May 16 '24 00:05 iancha1992