bazel icon indicating copy to clipboard operation
bazel copied to clipboard

remote_coverage_tools/Main cannot locate runfiles directory when building with `--nobuild_runfile_links`

Open UebelAndre opened this issue 1 year ago • 24 comments

Description of the bug:

Whenever --nobuild_runfiles_links is enabled I encounter a failure where the remote_coverage_tools/Main tool errors and fails to find runfiles. The logging is confusing as searching appears as though it should be inhibited.

https://buildkite.com/bazel/rules-rust-rustlang/builds/10120#018c7929-4564-4947-9fa6-ea73aa7e0e26

+ JAVA_RUNFILES=
+ exec bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/remote_coverage_tools/Main --coverage_dir=/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-fastbuild/testlogs/test/rust_test_suite/tests_suite_tests/integrated_test_b/_coverage --output_file=/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-fastbuild/testlogs/test/rust_test_suite/tests_suite_tests/integrated_test_b/coverage.dat --filter_sources=/usr/bin/.+ --filter_sources=/usr/lib/.+ --filter_sources=/usr/include.+ --filter_sources=/Applications/.+ --source_file_manifest=/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-fastbuild/bin/test/rust_test_suite/tests_suite_tests/integrated_test_b.instrumented_files
/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/remote_coverage_tools/Main: Cannot locate runfiles directory. (Set $JAVA_RUNFILES to inhibit searching.)

This issue is reproduced on https://github.com/bazelbuild/rules_rust/pull/2340

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This issue is reproduced on https://github.com/bazelbuild/rules_rust/pull/2340

Which operating system are you running Bazel on?

Linux, MacOS

What is the output of bazel info release?

7.0.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

UebelAndre avatar Dec 17 '23 19:12 UebelAndre

cc @scentini @krasimirgg

UebelAndre avatar Dec 17 '23 19:12 UebelAndre

Might relate to https://github.com/bazelbuild/bazel/issues/4685

UebelAndre avatar Dec 17 '23 19:12 UebelAndre

@sgowroji this bug is reproduced without remote execution. Is that tag correct?

UebelAndre avatar Dec 18 '23 15:12 UebelAndre

Does the bug go away if you use --noenable_runfiles instead of --nobuild_runfile_links?

This is admittedly very confusing, but --noenable_runfiles means "no runfile symlinks are created ever", while --nobuild_runfilke_links means "no runfile symlinks are created when the binary is built, but they'll be created just in time before the binary is executed (as a build action, bazel test or bazel run)". However, here you're invoking the binary directly outside of Bazel, so Bazel can't create them just in time.

Alternatively, you might be able to hack around it by editing the Java binary wrapper script to say RUNFILES_MANIFEST_ONLY=1 (see here - the placeholder is replaced at build time depending on the state of --enable_runfiles).

tjgq avatar Dec 18 '23 16:12 tjgq

(Note: assigning to the core team because they're the best owner for runfiles-related issues. I'm not actually convinced there's a bug, this might just be an unfortunate consequence of how things work.)

tjgq avatar Dec 18 '23 16:12 tjgq

However, here you're invoking the binary directly outside of Bazel

On a second look, I'm not sure this statement is true. But I'm still interested in the result of running with --noenable_runfiles, as well as the value of RUNFILES_MANIFEST_ONLY in the wrapper script.

tjgq avatar Dec 18 '23 16:12 tjgq

Does the bug go away if you use --noenable_runfiles instead of --nobuild_runfile_links?

This is fairly different behavior but I can add it to my PR and show you the logs

--nobuild_runfilke_links means "no runfile symlinks are created when the binary is built, but they'll be created just in time before the binary is executed (as a build action, bazel test or bazel run)". However, here you're invoking the binary directly outside of Bazel, so Bazel can't create them just in time.

I don't believe I'm doing such a thing. As far as I can tell everything that's running with the exception of the statically linked coverage binary from _collect_cc_coverage.

UebelAndre avatar Dec 18 '23 16:12 UebelAndre

Does the bug go away if you use --noenable_runfiles instead of --nobuild_runfile_links?

This is fairly different behavior but I can add it to my PR and show you the logs

It seems like --noenable_runfiles makes this issue go away but many other tests (expectedly) fail https://buildkite.com/bazel/rules-rust-rustlang/builds/10135#018c7dd1-2133-4a46-adab-babcdde9cf43. However, --noenable_runfiles is quite a different flag and not the desired behavior here. I'm not sure what you mean by:

here you're invoking the binary directly outside of Bazel

but it seems like the way coverage is setup should be to take this java tool as a binary so runfiles are kept.

UebelAndre avatar Dec 18 '23 16:12 UebelAndre

I'm not actually convinced there's a bug, this might just be an unfortunate consequence of how things work.

Sounds like "It's not a bug, it's a feature" 😄. Arguable though lol

UebelAndre avatar Dec 18 '23 16:12 UebelAndre

It seems like --noenable_runfiles makes this issue go away but many other tests (expectedly) fail

Ok, that validates the theory that the problem is that the Main binary is under the assumption that the runfiles tree will exist at the time it runs, because it was built with --nobuild_runfile_links. If you build it with --noenable_runfiles, it makes no such assumption. (I understand that setting --noenable_runfiles is not a viable workaround.)

I'm not sure what you mean by here you're invoking the binary directly outside of Bazel but it seems like the way coverage is setup should be to take this java tool as a binary so runfiles are kept.

Yeah, I'm now convinced that there's an actual bug here. Judging from the logs, it looks like the following is happening:

  • bazel coverage invokes collect_coverage.sh
  • collect_coverage.sh assembles an LCOV_MERGER_CMD, then invokes it
  • LCOV_MERGER_CMD runs the (previously built?) .../remote_coverage_tools/Main binary
  • The binary is unable to find its runfiles symlink tree, because it was neither created at build time nor just-in-time

So I expect there's some missing logic in the coverage command to ensure that the runfiles symlink tree for .../remote_coverage_tools/Main has been created before the command runs, if --nobuild_runfile_links is set. (For comparison, the local spawn runner does this by calling RunfilesTreeUpdater#updateRunfiles before executing the spawn; here it's a bit more involved because the runfiles aren't for the binary that is directly invoked by Bazel, but for a binary that it transitively invokes.)

tjgq avatar Dec 18 '23 17:12 tjgq

This would be a great one to get fixed since --nobuild_runfiles_links is a common flag in all large python code bases I'm familiar with but now to introduce rust either breaks coverage or explodes build times. Neither of which are acceptable trades.

UebelAndre avatar Dec 24 '23 17:12 UebelAndre

I also see this when I attempt to generate coverage reports using llvm-cov from a clang cc toolchain :(

UebelAndre avatar Jan 09 '24 14:01 UebelAndre

A fairly simple solution for this would be to replace the LCOV merger with a Graal native image, which as far as I know is the setup Google is using internally. This is already done for Turbine now, it would "just" require splitting remote_coverage_tools into one repo per architecture.

cc @hvadehra

fmeum avatar Jan 09 '24 15:01 fmeum

Is that something that can be done for Bazel 7.1.0? 🙏

UebelAndre avatar Jan 09 '24 16:01 UebelAndre

Any updates here? This is still impacting rules_rust.

UebelAndre avatar Mar 31 '24 13:03 UebelAndre

Ping here @tjgq @scentini @hvadehra

UebelAndre avatar Apr 24 '24 15:04 UebelAndre

@c-mita WDYT about the idea in https://github.com/bazelbuild/bazel/issues/20577#issuecomment-1883211431?

hvadehra avatar May 07 '24 13:05 hvadehra

Friendly ping @c-mita

UebelAndre avatar May 17 '24 12:05 UebelAndre

also interested. friendly ping @c-mita :)

Ryang20718 avatar Jun 07 '24 20:06 Ryang20718

We currently are trying to upgrade to bazel 7 with a rust + python codebase. Allowing the building of runfile links significantly slows down build time. is there a workaround in the meantime to allow --nobuild_runfile_links but still generate coverage files?

Ryang20718 avatar Jun 10 '24 01:06 Ryang20718

I just discovered that this has been fixed by https://github.com/bazelbuild/bazel/commit/404d8d97fafedb7c43a3c164a7446d2e1bf2e436. I sent https://github.com/bazelbuild/bazel/pull/22676 to verify this in tests.

fmeum avatar Jun 10 '24 13:06 fmeum

Is this fix gonna be in Bazel 7.3.0?

UebelAndre avatar Jun 12 '24 05:06 UebelAndre

This commit is part of a series of cleanups that look challenging to cherry-pick. @lberki Can you assess how much work that would be?

fmeum avatar Jun 12 '24 06:06 fmeum

really hoping that this can go in to 7.3.0! we're still stuck on bazel 6.5 because of this. friendly ping @lberki

Ryang20718 avatar Jun 22 '24 20:06 Ryang20718

Friendly ping @lberki @sgowroji wondering whether this https://github.com/bazelbuild/bazel/commit/c8006084e70f7faa942572808b6329b044f0f7ab fix will be cherrypicked into the next release?

Ryang20718 avatar Jul 22 '24 17:07 Ryang20718

I've also come across this issue and wondering if it's possible to backport to 7.X?

tomrenn avatar Aug 12 '24 16:08 tomrenn

@tomrenn we can include this in 7.4.0

iancha1992 avatar Aug 12 '24 21:08 iancha1992

@bazel-io fork 7.4.0

iancha1992 avatar Aug 12 '24 21:08 iancha1992