bazel icon indicating copy to clipboard operation
bazel copied to clipboard

failed: I/O exception during sandboxed execution: No such file or directory

Open Ryang20718 opened this issue 9 months ago • 10 comments

Description of the bug:

Periodically, the following error would occur when running tests

 Testing <blah> failed: I/O exception during sandboxed execution: /dev/shm/bazel-sandbox.34e9fe25bb0c6624a8ba8f5a00a18c3243dfc943119e415e09934c77b955441f/linux-sandbox/148/stats.out (No such file or directory)

we're on bazel 6.5.0 with spawn strategy linux-sandbox, jobs set to 1:1 with vcpus, sandbox mounted at /dev/shm

Whenever this occurs, we see system memory usage at 82-83% with cpu maxed at 100%.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I don't have a reliable repro. it happens sporadically

Which operating system are you running Bazel on?

ubuntu 20.04

What is the output of bazel info release?

release 6.5.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

this has been occuring more frequently since we switched to 6.5.0 from 6.3.2

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

Ryang20718 avatar Apr 26 '24 15:04 Ryang20718

@Ryang20718 Could you please provide sample code and complete steps to reproduce this issue? Also, please try updating your Bazel to one of our latest releases (See https://github.com/bazelbuild/bazel/releases). Thank you.

iancha1992 avatar Apr 26 '24 23:04 iancha1992

I don't have a reliable repro; it just periodically happens when running large amounts of tests. we're in the process of upgrading to bazel 7, but still need to upgrade some dependencies to get there

Ryang20718 avatar Apr 27 '24 16:04 Ryang20718

To provide another data point: I have hit the similar error message in rules_git log. That does the git checkout in the action into a declared directory.

It is reproducible when the same cache is used but not consistent across executions. For example, the remote execution passed fine.

That run used Bazel 7.1.1.

mattyclarkson avatar May 03 '24 08:05 mattyclarkson

https://github.com/bazelbuild/bazel/issues/20976

nikhilkalige avatar May 11 '24 20:05 nikhilkalige

Does everyone affected by this NOT have dynamic execution enabled?

oquenchil avatar May 14 '24 10:05 oquenchil

No dynamic execution was enabled for rules_git

mattyclarkson avatar May 14 '24 20:05 mattyclarkson

no dynamic execution (this is with a local execution with remote cache)

Ryang20718 avatar May 14 '24 21:05 Ryang20718

Aha, the remote cache bit is interesting too. @mattyclarkson did you have remote cache enabled too?

oquenchil avatar May 15 '24 11:05 oquenchil

The "remote" build passed which was using remote execution and remote cache.

The "local" build failed witch was running locally on the GitLab runner instance and was using a disk cache (which is stored/restored from the GitLab runner S3 bucket).

mattyclarkson avatar May 15 '24 12:05 mattyclarkson

adding some details here:

We've seen this same error in the following situations:

  1. System Mem is close to OOM
  2. System Mem is only at 50% usage

Originally I had thought it was a system error, but the 2nd bullet point indicates otherwise. (also had plenty of inodes + disk storage)

Ryang20718 avatar May 15 '24 16:05 Ryang20718

@oquenchil While I don't understand why exactly this is happening, it looks like linux-sandbox.cc has a number of ways to exit abnormally (e.g. via DIE on syscall failures) that would result in the spawn failing without the stats.out file having been written. What do you think of making the error here recoverable, perhaps showing or logging a warning: https://github.com/bazelbuild/bazel/blob/23e1c5d8d267e5825552ce5b05ddfb8ae8972688/src/main/java/com/google/devtools/build/lib/shell/ExecutionStatistics.java#L34

fmeum avatar Jun 18 '24 07:06 fmeum

Agree with Fabian diagnosis here. The most consistent theme from the reports in Slack was stats.out missing, most likely due to the sandbox process being killed abnormally (via OOM) or run into some disk issue and failing to write out the stats file.

Currently, on the java side, we are catching IOException here. However, if the sandbox subprocess was executed normally and exited with a non-zero code, we will unconditionally look for the execution stats file at the end of this method. This calls into ExecutionStatistics.java, as Fabian linked above, and throws an exception for the file not being available.

Since stats collection should be a non-critical feature, it should be done on a best-effort basis. The fix should be ExecutionStatistics.getResourceUsage() catching the error, logging out some warning and just returns Optional.empty()

sluongng avatar Jun 18 '24 08:06 sluongng

A fix for this issue has been included in Bazel 7.2.1 RC2. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.2.1rc2. Thanks!

iancha1992 avatar Jun 21 '24 21:06 iancha1992

I've pinned rules_git and using the RC in the BCR presubmit in bazelbuild/bazel-central-registry#1868. Seeing no issues.

mattyclarkson avatar Jun 24 '24 09:06 mattyclarkson

Hit an issue on CI run of rules_git when I rebased and the CI re-ran:

ERROR: github-mozilla-deepspeech/BUILD.bazel:3:20: Testing //github-mozilla-deepspeech:checkout failed: I/O exception during sandboxed execution: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/a1208da49aaa9451b147b4d0696a68a7/execroot/_main/bazel-out/k8-fastbuild/bin/external/_main~_repo_rules~github-mozilla-deepspeech-0.9.3/checkout/tensorflow/native_client/ctcdecode/third_party/openfst-1.6.7/src/include/fst/extensions/pdt (No such file or directory)

link

mattyclarkson avatar Jun 25 '24 14:06 mattyclarkson

@mattyclarkson That's a different type of bug as it's not about stats.out. Could you file a separate issue for this?

fmeum avatar Jun 25 '24 15:06 fmeum