bazel
bazel copied to clipboard
failed: I/O exception during sandboxed execution: No such file or directory
Description of the bug:
Periodically, the following error would occur when running tests
Testing <blah> failed: I/O exception during sandboxed execution: /dev/shm/bazel-sandbox.34e9fe25bb0c6624a8ba8f5a00a18c3243dfc943119e415e09934c77b955441f/linux-sandbox/148/stats.out (No such file or directory)
we're on bazel 6.5.0 with spawn strategy linux-sandbox
, jobs set to 1:1 with vcpus, sandbox mounted at /dev/shm
Whenever this occurs, we see system memory usage at 82-83% with cpu maxed at 100%.
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
I don't have a reliable repro. it happens sporadically
Which operating system are you running Bazel on?
ubuntu 20.04
What is the output of bazel info release
?
release 6.5.0
If bazel info release
returns development version
or (@non-git)
, tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse HEAD
?
No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
this has been occuring more frequently since we switched to 6.5.0 from 6.3.2
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
@Ryang20718 Could you please provide sample code and complete steps to reproduce this issue? Also, please try updating your Bazel to one of our latest releases (See https://github.com/bazelbuild/bazel/releases). Thank you.
I don't have a reliable repro; it just periodically happens when running large amounts of tests. we're in the process of upgrading to bazel 7, but still need to upgrade some dependencies to get there
To provide another data point: I have hit the similar error message in rules_git
log. That does the git
checkout in the action into a declared directory.
It is reproducible when the same cache is used but not consistent across executions. For example, the remote execution passed fine.
That run used Bazel 7.1.1.
https://github.com/bazelbuild/bazel/issues/20976
Does everyone affected by this NOT have dynamic execution enabled?
No dynamic execution was enabled for rules_git
no dynamic execution (this is with a local execution with remote cache)
Aha, the remote cache bit is interesting too. @mattyclarkson did you have remote cache enabled too?
The "remote" build passed which was using remote execution and remote cache.
The "local" build failed witch was running locally on the GitLab runner instance and was using a disk cache (which is stored/restored from the GitLab runner S3 bucket).
adding some details here:
We've seen this same error in the following situations:
- System Mem is close to OOM
- System Mem is only at 50% usage
Originally I had thought it was a system error, but the 2nd bullet point indicates otherwise. (also had plenty of inodes + disk storage)
@oquenchil While I don't understand why exactly this is happening, it looks like linux-sandbox.cc
has a number of ways to exit abnormally (e.g. via DIE
on syscall failures) that would result in the spawn failing without the stats.out
file having been written. What do you think of making the error here recoverable, perhaps showing or logging a warning:
https://github.com/bazelbuild/bazel/blob/23e1c5d8d267e5825552ce5b05ddfb8ae8972688/src/main/java/com/google/devtools/build/lib/shell/ExecutionStatistics.java#L34
Agree with Fabian diagnosis here. The most consistent theme from the reports in Slack was stats.out
missing, most likely due to the sandbox process being killed abnormally (via OOM) or run into some disk issue and failing to write out the stats file.
Currently, on the java side, we are catching IOException here.
However, if the sandbox subprocess was executed normally and exited with a non-zero code, we will unconditionally look for the execution stats file at the end of this method. This calls into ExecutionStatistics.java
, as Fabian linked above, and throws an exception for the file not being available.
Since stats collection should be a non-critical feature, it should be done on a best-effort basis. The fix should be ExecutionStatistics.getResourceUsage()
catching the error, logging out some warning and just returns Optional.empty()
A fix for this issue has been included in Bazel 7.2.1 RC2. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.2.1rc2
. Thanks!
I've pinned rules_git
and using the RC in the BCR presubmit in bazelbuild/bazel-central-registry#1868. Seeing no issues.
Hit an issue on CI run of rules_git
when I rebased and the CI re-ran:
ERROR: github-mozilla-deepspeech/BUILD.bazel:3:20: Testing //github-mozilla-deepspeech:checkout failed: I/O exception during sandboxed execution: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/a1208da49aaa9451b147b4d0696a68a7/execroot/_main/bazel-out/k8-fastbuild/bin/external/_main~_repo_rules~github-mozilla-deepspeech-0.9.3/checkout/tensorflow/native_client/ctcdecode/third_party/openfst-1.6.7/src/include/fst/extensions/pdt (No such file or directory)
@mattyclarkson That's a different type of bug as it's not about stats.out
. Could you file a separate issue for this?