bazel icon indicating copy to clipboard operation
bazel copied to clipboard

[7.6.1] `build --features=thin_lto` may fail with `error reading imports file .o.imports: Missing digest:`

Open gdh1995 opened this issue 3 months ago • 19 comments

Description of the bug:

Hello recently I upgraded Bazel from 6.5.0 to 7.6.1 and ran into an error about remote cache, I tried --experimental_remote_cache_eviction_retries=5 but it doesn't help.

LTO Backend Compile bazel-out/arm-opt/bin/path/to/libbar.so.lto/bazel-out/arm-opt/bin/path/to/_objs/another/another.pic.o failed: \
error reading imports file /workdir/workspace/repo/.cache/execroot/my_workspace/bazel-out/arm-opt/bin/path/to/libbar.so.lto/bazel-out/arm-opt/bin/path/to/_objs/another/another.pic.o.imports: \
Missing digest: HASH/LENGTH for bazel-out/arm-opt/bin/path/to/libbar.so.lto/bazel-out/arm-opt/bin/path/to/_objs/another/another.pic.o

Details

  • --remote_download_outputs is the default value toplevel
  • and I also enabled --experimental_remote_cache_eviction_retries=5. So I once thought Bazel would auto retry for several times.
  • --remote_local_fallback=true
  • --incompatible_allow_tags_propagation is the default value true
cc_binary(
  name = "bar",
  linkshared = True,
  # I want to use "no-cache" to disable uploading libbar.so (which is too big: about 1.5GB),
  # however it seems `LTO indexing` inherits all the tags (though `LTO Backend Compile doesn't inherit them)
  tags = ["no-cache", "no-remote"],
  deps = [...],
)

And the command line was bazel build --remote_cache=http://my_cache //path/to:some_pkg

But the building log has nothing like ERROR: Build did NOT complete successfully\nFound transient remote cache error, retrying the build...

After reading source code of version 7.6.1, I think it's a bug of Bazel:

  • experimental_remote_cache_eviction_retries is only checked by
    • RemoteSpawnRunner::exec -> execLocallyAndUploadOrFail -> handleError
    • LocalSpawnRunner::exec -> prefetchInputsAndWait -> prefetchInputs
    • BlazeCommandDispatcher::exec -> execExclusively
  • but, error reading imports file is in LtoBackendAction::discoverInputs,
    • called by ActionExecutionFunction::checkCacheAndExecuteIfNeeded
    • called by ActionExecutionFunction::compute
    • called by AbstractParallelEvaluator::run
    • (there're too many .run(), not surce where to call it)
  • Then the LtoBackendAction::discoverInputs is not protected by remoteRetryOnTransientCacheError

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Ubuntu 20.04, x86_64

What is the output of bazel info release?

release 7.6.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

(A private repo)

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

Not found any LTO+remote-cache issue.

Any other information, logs, or outputs that you want to share?

No response

gdh1995 avatar Sep 11 '25 05:09 gdh1995

Does this repro at a later Bazel version, like 8 or HEAD?

jin avatar Sep 16 '25 13:09 jin

Hi, I think we're observing this, too.

ERROR: <path>/BUILD.bazel:54:10: LTO Backend Compile <path>/init.pic.o failed: error reading imports file <path>/init.pic.o.imports Missing digest: 79a19f0ec3e5fc00ad50f7236e4b2c2cc13ed24e5ad5b3d9c05858b7c8e91c7c/257 for init.pic.o.imports

Does this repro at a later Bazel version, like 8 or HEAD?

We're simply trying to upgrade from bazel 6.4.0 -> 7.6.1 (which has been painful). We'd like to complete the move to 7 first, before attempting to upgrade to bazel 8.

Is there more info I can provide?


This didn't reproduce when I rekicked this particular build. I'm not sure yet how intermittent this issue may be.

nickdesaulniers avatar Sep 23 '25 20:09 nickdesaulniers

cc @coeuvre @tjgq @fmeum

This issue smells like:

  • https://github.com/bazelbuild/bazel/issues/22854
  • https://github.com/bazelbuild/bazel/issues/20161
  • https://github.com/bazelbuild/bazel/issues/22387
  • https://github.com/bazelbuild/bazel/issues/18696

but based on @gdh1995 's analysis, seems specific to LTO?

Then the LtoBackendAction::discoverInputs is not protected by remoteRetryOnTransientCacheError

nickdesaulniers avatar Sep 23 '25 20:09 nickdesaulniers

My recent work on aligning build rewinding with action rewinding should apply to LTO as well due to this check after input discovery fails: https://cs.opensource.google/bazel/bazel/+/cc051e320207419090be6532b5f89a1518f6e500:src/main/java/com/google/devtools/build/lib/skyframe/SkyframeActionExecutor.java;l=958

This would only trigger if the remote cache actually lost the file (not on a temporary connection error) and may require Bazel 8.

fmeum avatar Sep 24 '25 06:09 fmeum

and may require Bazel 8.

😬 was hoping to transition to bazel 7 first, then work through transitioning to bazel 8. Afraid the bazel 6 -> bazel 7 upgrade could get rolled back for us due to this.

nickdesaulniers avatar Sep 24 '25 22:09 nickdesaulniers

Hi @fmeum I'm seeing this failure pretty regularly. Any other tips or workarounds (perhaps from the issues I link to above)? Perhaps a minimum version of bazel 8 to try upgrading to?

nickdesaulniers avatar Oct 06 '25 22:10 nickdesaulniers

What is even implied by the error message? Why would a digest be missing?

nickdesaulniers avatar Oct 10 '25 17:10 nickdesaulniers

It's missing from the remote cache. Bazel only has the digest and now needs to read the file, which leads to a request to the remote cache for the content.

Could you test with 8.4.2?

fmeum avatar Oct 11 '25 06:10 fmeum

Are you also adding no-cache and no-remote tags? It is possible that Bazel gets confused about what's available from where.

fmeum avatar Oct 11 '25 06:10 fmeum

Um, in my project, bazel aquery shows LTO Backend Compile actions are not affected by "no-cache", "no-remote". And, all upstream cc_library have remote cache enabled.

gdh1995 avatar Oct 11 '25 07:10 gdh1995

Would it be possible to create a standalone reproducer? I can run it against a remote cache.

fmeum avatar Oct 11 '25 08:10 fmeum

You may start a "remote cache server" locally using the bazel-remote software, and then:

  1. bazel clean; bazel build --remote_cache=http://localhost:port/ --features=thin_lto
  2. delete some files under <cache_dir>/cas.v2/...
  3. delete the building output and generated bazel-bin/xxxxx.lto/ folder
  4. bazel build ... again to let bazel try downloading those files

gdh1995 avatar Oct 11 '25 08:10 gdh1995

I haven't been able to reproduce on 8.4.2 - could you give that a try?

fmeum avatar Oct 13 '25 13:10 fmeum

It's missing from the remote cache. Bazel only has the digest and now needs to read the file, which leads to a request to the remote cache for the content.

How does bazel know the digest for a file it does not have?

I noticed that bazel-7 seems to have support startup --digest_function=blake3; I wonder if perhaps the server is taking too long to hash the file being requested with the default sha256? Will give that a try to see if it helps.

EDIT: startup --digest_function=blake3 did not help. Still observing this failure with it set.

How can I check my cache to see if the requested file does indeed exist, or not?

Could you test with 8.4.2? I haven't been able to reproduce on 8.4.2 - could you give that a try?

Thank you for attempting to reproduce.

I'm able to use bazelisk and bump my .bazelversion to 8.0.0 (or higher) to test, but it looks like numerous third party dependencies of mine are broken in bazel-8. We're also not yet on bzlmod, so upgrading those is going to be painful. I cannot test with 8.4.2 quickly due to those issues.

Any chance some retry logic could get backported to a 7.6.X release?

Are you also adding no-cache and no-remote tags?

Looks like our .bazelrc has:

build --modify_execution_info='CppLink.*=+no-remote'

Looking at the commit history on that line:

Linking is a super fast action when ran locally, and downloading
linked libraries / binaries from the remote cache can take
much longer.

Not sure if that's perhaps the problem? Should I try removing it?

nickdesaulniers avatar Oct 13 '25 20:10 nickdesaulniers

I'm even seeing this on non-LTO objects:

ERROR: <path>/BUILD.bazel:120:20: Compiling <path>.cpp failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 429cbc53f3bbdbb54f4314cd781530b01495e91c5d677499391a20fff9013058/13868 for <path>.h

I guess we probably won't be able to upgrade to bazel-7. Hopefully all of the work required to upgrade to bazel-8 pays off.

Is there anything I can do serverside to check for the existence of that file? I think the server only contains filenames that are shas.

nickdesaulniers avatar Oct 21 '25 17:10 nickdesaulniers

Is there any bazel command line flag that doesn't make this a hard error? As in, if there's a failure, just build whatever locally?

nickdesaulniers avatar Oct 24 '25 17:10 nickdesaulniers

Heh, I just pumped this thread into claude haiku 4.5. It seems to agree with this analysis after reading through the code (on release-7.6.2); I'm not a sophisticated enough bazel developer to know if this is true or just AI slop.

Is there any bazel command line flag that doesn't make this a hard error? As in, if there's a failure, just build whatever locally?

(Sounds like --remote_local_fallback=true is what I was looking for, I will give that a shot. I bet the risk there is it probably hides the case of the remote cache failing 100% or even a large amount rather than these small transient failures).

FWIW:


Analysis: LTO Backend Imports File Remote Cache Issue

Report Validation: TRUE

The report is accurate. The user has correctly identified a real bug in Bazel 7.6.1 where LtoBackendAction::discoverInputs errors are not protected by the --experimental_remote_cache_eviction_retries retry mechanism.


Issue Summary

When building with LTO and remote caching enabled, LtoBackendAction::discoverInputs can fail with:

error reading imports file /workdir/workspace/repo/.cache/execroot/my_workspace/bazel-out/arm-opt/bin/path/to/libbar.so.lto/...: Missing digest: HASH/LENGTH for ...

This error is transient (caused by remote cache eviction) but is not retried by Bazel, even when --experimental_remote_cache_eviction_retries is set.


Root Cause Analysis

Error Flow

  1. LtoBackendAction reads an imports file (line 820 in LtoBackendAction.java):

    lines = FileSystemUtils.readLinesAsLatin1(importsFilePath);
    
  2. If the imports file's content references artifacts that have been evicted from remote cache, the discoverInputs method throws ActionExecutionException:

    throw new ActionExecutionException(message, e, this, false, code);
    
  3. This exception is caught in SkyframeActionExecutor::discoverInputs (line 890 in SkyframeActionExecutor.java):

    } catch (ActionExecutionException e) {
      // Input discovery failures may be caused by lost inputs...
      if (!(e instanceof LostInputsActionExecutionException)) {
        try {
          checkActionFileSystemForLostInputs(actionFileSystem, action, outputService);
        } catch (LostInputsActionExecutionException lostInputsException) {
          e = lostInputsException;
        }
      }
      // ...
      throw finalException;
    }
    

Why Retry Doesn't Work

The --experimental_remote_cache_eviction_retries flag only handles retries at two specific locations:

  1. RemoteSpawnRunner (line 619):

    • When downloading remote execution outputs fails with CacheNotFoundException
    • Sets exit code to REMOTE_CACHE_EVICTED, which triggers command-level retry
  2. AbstractSpawnStrategy (line 295):

    • When prefetching inputs before local execution fails with BulkTransferException containing CacheNotFoundException
    • Sets exit code to REMOTE_CACHE_EVICTED, which triggers command-level retry
  3. BlazeCommandDispatcher (line 692):

    • Catches exit code REMOTE_CACHE_EVICTED and re-executes the entire build command

The Missing Link

LtoBackendAction::discoverInputs is called from ActionExecutionFunction::checkCacheAndExecuteIfNeeded (line 823):

try (SilentCloseable c = Profiler.instance().profile(ProfilerTask.INFO, "discoverInputs")) {
  state.discoveredInputs =
      skyframeActionExecutor.discoverInputs(
          action,
          actionLookupData,
          inputMetadataProvider,
          outputMetadataStore,
          env,
          state.actionFileSystem);
}

There is NO exception handling wrapping this call. If discoverInputs throws an ActionExecutionException that isn't a LostInputsActionExecutionException, it propagates directly as a fatal error.

The exception path is:

  • LtoBackendAction::discoverInputsActionExecutionException
  • SkyframeActionExecutor::discoverInputsActionExecutionException (possibly converted to LostInputsActionExecutionException)
  • ActionExecutionFunction::checkCacheAndExecuteIfNeeded → catches LostInputsActionExecutionException only (line 343)
  • Other ActionExecutionException → propagates as AlreadyReportedActionExecutionException (line 355)

Critical Issue: The error occurs when reading the imports file itself, not when processing lost inputs. The I/O error happens during file reading:

try {
  lines = FileSystemUtils.readLinesAsLatin1(importsFilePath);
} catch (IOException e) {
  String message = String.format("error reading imports file %s: %s", ...);
  DetailedExitCode code = createDetailedExitCode(message, Code.IMPORTS_READ_IO_EXCEPTION);
  throw new ActionExecutionException(message, e, this, false, code);  // ← IOException wrapped
}

This exception is wrapped as a plain ActionExecutionException with code IMPORTS_READ_IO_EXCEPTION, not detected as a lost input exception by the rewinding machinery.


Error Classification

The error "Missing digest" comes from CacheNotFoundException (file: remote/common/CacheNotFoundException.java line 50):

"Missing digest: " + missingDigest.getHash() + "/" + missingDigest.getSizeBytes();

This is wrapped in IOException when reading the imports file, but:

  • The original CacheNotFoundException is not re-thrown as LostInputsActionExecutionException
  • The Skyframe action executor doesn't recognize it as a lost input
  • The error doesn't trigger REMOTE_CACHE_EVICTED exit code
  • The build fails immediately without retry

Workarounds

Workaround 1: Use --incompatible_remote_use_new_exit_code_for_lost_inputs (PARTIAL)

bazel build --remote_cache=http://my_cache \
  --incompatible_remote_use_new_exit_code_for_lost_inputs \
  //path/to:some_pkg

Effectiveness: ❌ Limited - This only applies to some spawn execution paths, not to input discovery.

Workaround 2: Reduce Remote Cache Evictions

Ensure the remote cache has sufficient capacity or retention policy:

  • Check cache size limits
  • Adjust cache retention policies
  • Ensure cache server is not evicting artifacts too aggressively

Effectiveness: ✓ Best practical workaround - Prevents the underlying issue

Workaround 3: Disable Remote Caching for LTO Outputs

Use --remote_upload_local_results=false to avoid uploading large LTO-generated artifacts:

bazel build --remote_cache=http://my_cache \
  --remote_upload_local_results=false \
  //path/to:some_pkg

Effectiveness: ⚠️ Partial - Reduces cache misses but still vulnerable to other artifacts being evicted.

Workaround 4: Use Local Fallback

bazel build --remote_cache=http://my_cache \
  --remote_local_fallback=true \
  //path/to:some_pkg

Effectiveness: ✓ Good - Falls back to local execution if remote cache fails, though may be slower.

Workaround 5: Retry at Script Level

Since the error should be transient, wrap the build in a retry loop:

#!/bin/bash
MAX_RETRIES=5
for i in $(seq 1 $MAX_RETRIES); do
  bazel build --remote_cache=http://my_cache //path/to:some_pkg && exit 0
  echo "Build attempt $i failed, retrying..."
  sleep 2
done
exit 1

Effectiveness: ✓ Good - Works around the bug, though not elegant.


Technical Details: Code Path

Current (Broken) Flow:

LtoBackendAction::discoverInputs()
  ├─ FileSystemUtils.readLinesAsLatin1(importsFilePath)
  │  └─ IOException (caused by CacheNotFoundException from remote fetch)
  └─ throw ActionExecutionException(IMPORTS_READ_IO_EXCEPTION)
     ↓
SkyframeActionExecutor::discoverInputs()
  └─ catch (ActionExecutionException)
     └─ if (!(e instanceof LostInputsActionExecutionException)) → NOT caught as lost input
        └─ throw ActionExecutionException
           ↓
ActionExecutionFunction::checkCacheAndExecuteIfNeeded()
  └─ catch (LostInputsActionExecutionException) → NOT matched
     └─ catch (ActionExecutionException)
        └─ throw AlreadyReportedActionExecutionException
           ↓
ActionExecutionFunction::compute()
  └─ throw as ActionExecutionFunctionException
     ↓
BUILD FAILS (no retry)

Expected (Fixed) Flow:

The error should either:

  1. Be caught at input discovery level and trigger LostInputsActionExecutionException handling
  2. Or set exit code to REMOTE_CACHE_EVICTED before reaching Skyframe evaluation
  3. This would trigger command-level retry in BlazeCommandDispatcher

Why This is a Bug

  1. Inconsistent retry behavior: Remote cache errors during prefetchInputs() are retried, but those during discoverInputs() are not

  2. Lost inputs detection doesn't apply: The LostInputsActionExecutionException mechanism is designed to handle this but doesn't recognize input discovery failures as lost input errors

  3. User expectation mismatch: Setting --experimental_remote_cache_eviction_retries=5 explicitly does NOT protect input discovery, contradicting the flag's documentation

  4. Works for regular actions but not LTO: C++ compile actions that discover inputs (CppCompileAction::discoverInputs) might also be affected


Fix Strategy (for Bazel team)

The fix should be in SkyframeActionExecutor::discoverInputs() to properly detect and handle CacheNotFoundException wrapped in IOException:

// Pseudo-code fix
} catch (ActionExecutionException e) {
  if (e.getCause() instanceof BulkTransferException bulkError) {
    if (BulkTransferException.allCausedByCacheNotFoundException(bulkError)) {
      // Treat as lost input for retry purposes
      // Convert to LostInputsActionExecutionException or set appropriate exit code
    }
  }
  // ... existing handling
}

Conclusion

Report is Valid: This is a confirmed bug in Bazel 7.6.1

No Direct Flag Workaround: --experimental_remote_cache_eviction_retries does not help with input discovery errors

Best Practical Workaround:

  • Ensure remote cache stability and capacity
  • Use --remote_local_fallback=true for resilience
  • Wrap builds in retry scripts at the process level

Error is Transient: Can be successfully retried, Bazel just doesn't do it automatically for input discovery

nickdesaulniers avatar Oct 24 '25 18:10 nickdesaulniers

The AI assessment is pretty accurate, but only applies to the Bazel 7 branch. Bazel 8 and higher should not be affected.

fmeum avatar Oct 24 '25 18:10 fmeum

oh no, we observe this failure, even with --remote_local_fallback=true set.

nickdesaulniers avatar Oct 27 '25 15:10 nickdesaulniers