bazel icon indicating copy to clipboard operation
bazel copied to clipboard

No local fallback after cache timeout

Open miscott2 opened this issue 2 years ago • 12 comments

Description of the bug:

While running a build our Artifactory HTTP cache timed out for a request. Obviously we're looking at why it did that but we expected Bazel to fall back to running the action locally and instead it failed the build ERROR: /<workspace path>/BUILD:1802:10: Compiling <source file>.c failed: unable to finalize action: Download of '/<artifactory repo path>/cas/b8f31e5fda95495273a86cc5c7395298eb321490395ca90815a9184f2a9ec980' timed out. Received 0 bytes.

The documentation suggests --remote_local_fallback only applies to remote execution but some comments on the bug tracker suggested it might also apply to remote caching so we tried that but still saw the issue.

I can try to recreate the issue but it's not totally trivial as I'll need to setup an HTTP server that can deliberately time out. So thought I'd check if this is expected behavior or if perhaps there is a trivially obvious bug to someone who knows the Bazel source.

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Not trivial to reproduce! I can work on that if it would be useful.

Which operating system are you running Bazel on?

Linux - RHEL 8

What is the output of bazel info release?

release 7.0.0-pre.20231011.2- (@non-git)

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

Built from the 7.0.0-pre.20231011.2 release tag.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No

Any other information, logs, or outputs that you want to share?

No response

miscott2 avatar Nov 09 '23 16:11 miscott2

The intention behind --remote_local_fallback was indeed for it to only apply to remote execution. So this should be considered a feature request to add a similar feature for remote caching.

One possibility is to fold this work into #19904 (but that would be a fairly large project, so we might still consider implementing this differently in the interim).

tjgq avatar Nov 14 '23 10:11 tjgq

We used --remote_local_fallback with --remote_cache=<url> and it works, in case of any outage on remote cache server side the build was proceeding without caching. It still works in bazel 6.3.2

JSGette avatar Nov 16 '23 15:11 JSGette

@tjgq : @JSGette 's comment made me re-check our logs. All the examples I can find relate to actions where it's downloading .d files as part of cc_common.compile(). While our builds are about 2/3rds compile actions the number of examples is starting to look suspicious. There are also examples of timeouts with actions that aren't part of cc_common.compile() and they correctly show up as warnings and a local action is run.

Could there be something special about these .d files? I know that cc_common.compile() does a special end of action step to trim dependencies which I assume is using the .d files. Could there be something about that step which is making cache timeouts behave differently in this case?

I'll try and do more investigation on our end as well. I did setup my own HTTP cache that would timeout for a specific CAS entry corresponding to a .d file but so far I haven't reproduced the issue.

miscott2 avatar Nov 20 '23 16:11 miscott2

I'm also seeing something similar to this in Bazel 7 without BwtB:

11:01:10 ERROR: Foo/BUILD.bazel:11:15: Compiling Foo.c failed: unable to finalize action: Missing digest: <number>/<number> for bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-<sha>/bin/path/to/Foo.d

Interestingly we're only seeing this for the .d files as well

luispadron avatar May 13 '24 02:05 luispadron

@tjgq is this actually a feature request? It feels like a bug since this works just fine for us in Bazel 6

luispadron avatar May 13 '24 15:05 luispadron

There are two distinct issues here.

  1. The fact that --remote_local_fallback doesn't cause a fallback to occur for local execution with a remote cache is a FR (because the flag is only supposed to have an effect for remote execution).
  2. The fact that action execution fails with unable to finalize action may be a bug. Note that the error message in the original report is a timeout, while yours is a missing digest (so it's unclear that we're looking at the same root cause).

A missing digest means that Bazel was previously made aware of the existence of a digest in the remote cache, but it's no longer there by the time it tries to download it. The "build without the bytes" default has changed between Bazel 6 and 7, which widens the window between these two events (during which the blob can be evicted, i.e., deleted from the cache). It's completely up to the remote cache to decide for how long to keep an entry around; Bazel does not set an explicit lifetime nor ask for entries to be deleted.

Does the remote cache implementation you're using provide any sort of log that could be used to determine whether the missing digest used to be there, and if so, the reason why it was evicted?

The fact that this only happens with .d files is suspicious (they are, in fact, something of an edge case in Bazel), but to be frank, without a repro I'm not really sure where I should be looking for a bug.

tjgq avatar May 13 '24 16:05 tjgq

Thanks for the reply, yeah the error are slightly different but related cause of the .d files.

FWIW we're using remote_download_outputs=all so I was expecting nothing to change here for us. Any ideas what to check next besides the remote cache logs? I can open a separate issue for this if you think that makes sense too.

luispadron avatar May 13 '24 16:05 luispadron

I have a hunch: does setting --noexperimental_inmemory_dotd_files make the issue go away?

Otherwise, capturing a --experimental_remote_grpc_log (log of all of the interactions between Bazel and the remote cache) should make it possible to check whether Bazel was indeed told by the remote cache that the digest was present (and how much time elapsed until it tried to download it).

tjgq avatar May 13 '24 16:05 tjgq

Thanks for the suggestion we're testing out --noexperimental_inmemory_dotd_files now

luispadron avatar May 13 '24 16:05 luispadron

@tjgq So --noexperimental_inmemory_dotd_files does seem to work, at least we haven't hit this issue in a few iterations. Should I open a separate issue for that or is this known?

luispadron avatar May 15 '24 19:05 luispadron

Thanks for confirming my suspicion; that gives me a hint as to where the problem might be. Do you mind filing a fresh issue so we can track it separately?

tjgq avatar May 15 '24 19:05 tjgq

I filed https://github.com/bazelbuild/bazel/issues/22387 thanks!

luispadron avatar May 15 '24 21:05 luispadron