bazel
bazel copied to clipboard
remote_coverage_tools/Main cannot locate runfiles directory when building with `--nobuild_runfile_links`
Description of the bug:
Whenever --nobuild_runfiles_links is enabled I encounter a failure where the remote_coverage_tools/Main tool errors and fails to find runfiles. The logging is confusing as searching appears as though it should be inhibited.
https://buildkite.com/bazel/rules-rust-rustlang/builds/10120#018c7929-4564-4947-9fa6-ea73aa7e0e26
+ JAVA_RUNFILES=
+ exec bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/remote_coverage_tools/Main --coverage_dir=/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-fastbuild/testlogs/test/rust_test_suite/tests_suite_tests/integrated_test_b/_coverage --output_file=/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-fastbuild/testlogs/test/rust_test_suite/tests_suite_tests/integrated_test_b/coverage.dat --filter_sources=/usr/bin/.+ --filter_sources=/usr/lib/.+ --filter_sources=/usr/include.+ --filter_sources=/Applications/.+ --source_file_manifest=/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-fastbuild/bin/test/rust_test_suite/tests_suite_tests/integrated_test_b.instrumented_files
/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3234/execroot/rules_rust/bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/remote_coverage_tools/Main: Cannot locate runfiles directory. (Set $JAVA_RUNFILES to inhibit searching.)
This issue is reproduced on https://github.com/bazelbuild/rules_rust/pull/2340
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
This issue is reproduced on https://github.com/bazelbuild/rules_rust/pull/2340
Which operating system are you running Bazel on?
Linux, MacOS
What is the output of bazel info release?
7.0.0
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?
No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
cc @scentini @krasimirgg
Might relate to https://github.com/bazelbuild/bazel/issues/4685
@sgowroji this bug is reproduced without remote execution. Is that tag correct?
Does the bug go away if you use --noenable_runfiles instead of --nobuild_runfile_links?
This is admittedly very confusing, but --noenable_runfiles means "no runfile symlinks are created ever", while --nobuild_runfilke_links means "no runfile symlinks are created when the binary is built, but they'll be created just in time before the binary is executed (as a build action, bazel test or bazel run)". However, here you're invoking the binary directly outside of Bazel, so Bazel can't create them just in time.
Alternatively, you might be able to hack around it by editing the Java binary wrapper script to say RUNFILES_MANIFEST_ONLY=1 (see here - the placeholder is replaced at build time depending on the state of --enable_runfiles).
(Note: assigning to the core team because they're the best owner for runfiles-related issues. I'm not actually convinced there's a bug, this might just be an unfortunate consequence of how things work.)
However, here you're invoking the binary directly outside of Bazel
On a second look, I'm not sure this statement is true. But I'm still interested in the result of running with --noenable_runfiles, as well as the value of RUNFILES_MANIFEST_ONLY in the wrapper script.
Does the bug go away if you use
--noenable_runfilesinstead of--nobuild_runfile_links?
This is fairly different behavior but I can add it to my PR and show you the logs
--nobuild_runfilke_linksmeans "no runfile symlinks are created when the binary is built, but they'll be created just in time before the binary is executed (as a build action, bazel test or bazel run)". However, here you're invoking the binary directly outside of Bazel, so Bazel can't create them just in time.
I don't believe I'm doing such a thing. As far as I can tell everything that's running with the exception of the statically linked coverage binary from _collect_cc_coverage.
Does the bug go away if you use
--noenable_runfilesinstead of--nobuild_runfile_links?This is fairly different behavior but I can add it to my PR and show you the logs
It seems like --noenable_runfiles makes this issue go away but many other tests (expectedly) fail https://buildkite.com/bazel/rules-rust-rustlang/builds/10135#018c7dd1-2133-4a46-adab-babcdde9cf43. However, --noenable_runfiles is quite a different flag and not the desired behavior here. I'm not sure what you mean by:
here you're invoking the binary directly outside of Bazel
but it seems like the way coverage is setup should be to take this java tool as a binary so runfiles are kept.
I'm not actually convinced there's a bug, this might just be an unfortunate consequence of how things work.
Sounds like "It's not a bug, it's a feature" 😄. Arguable though lol
It seems like --noenable_runfiles makes this issue go away but many other tests (expectedly) fail
Ok, that validates the theory that the problem is that the Main binary is under the assumption that the runfiles tree will exist at the time it runs, because it was built with --nobuild_runfile_links. If you build it with --noenable_runfiles, it makes no such assumption. (I understand that setting --noenable_runfiles is not a viable workaround.)
I'm not sure what you mean by here you're invoking the binary directly outside of Bazel but it seems like the way coverage is setup should be to take this java tool as a binary so runfiles are kept.
Yeah, I'm now convinced that there's an actual bug here. Judging from the logs, it looks like the following is happening:
bazel coverageinvokescollect_coverage.shcollect_coverage.shassembles anLCOV_MERGER_CMD, then invokes itLCOV_MERGER_CMDruns the (previously built?).../remote_coverage_tools/Mainbinary- The binary is unable to find its runfiles symlink tree, because it was neither created at build time nor just-in-time
So I expect there's some missing logic in the coverage command to ensure that the runfiles symlink tree for .../remote_coverage_tools/Main has been created before the command runs, if --nobuild_runfile_links is set. (For comparison, the local spawn runner does this by calling RunfilesTreeUpdater#updateRunfiles before executing the spawn; here it's a bit more involved because the runfiles aren't for the binary that is directly invoked by Bazel, but for a binary that it transitively invokes.)
This would be a great one to get fixed since --nobuild_runfiles_links is a common flag in all large python code bases I'm familiar with but now to introduce rust either breaks coverage or explodes build times. Neither of which are acceptable trades.
I also see this when I attempt to generate coverage reports using llvm-cov from a clang cc toolchain :(
A fairly simple solution for this would be to replace the LCOV merger with a Graal native image, which as far as I know is the setup Google is using internally. This is already done for Turbine now, it would "just" require splitting remote_coverage_tools into one repo per architecture.
cc @hvadehra
Is that something that can be done for Bazel 7.1.0? 🙏
Any updates here? This is still impacting rules_rust.
Ping here @tjgq @scentini @hvadehra
@c-mita WDYT about the idea in https://github.com/bazelbuild/bazel/issues/20577#issuecomment-1883211431?
Friendly ping @c-mita
also interested. friendly ping @c-mita :)
We currently are trying to upgrade to bazel 7 with a rust + python codebase. Allowing the building of runfile links significantly slows down build time. is there a workaround in the meantime to allow --nobuild_runfile_links but still generate coverage files?
I just discovered that this has been fixed by https://github.com/bazelbuild/bazel/commit/404d8d97fafedb7c43a3c164a7446d2e1bf2e436. I sent https://github.com/bazelbuild/bazel/pull/22676 to verify this in tests.
Is this fix gonna be in Bazel 7.3.0?
This commit is part of a series of cleanups that look challenging to cherry-pick. @lberki Can you assess how much work that would be?
really hoping that this can go in to 7.3.0! we're still stuck on bazel 6.5 because of this. friendly ping @lberki
Friendly ping @lberki @sgowroji wondering whether this https://github.com/bazelbuild/bazel/commit/c8006084e70f7faa942572808b6329b044f0f7ab fix will be cherrypicked into the next release?
I've also come across this issue and wondering if it's possible to backport to 7.X?
@tomrenn we can include this in 7.4.0
@bazel-io fork 7.4.0