bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Tree artifact up-to-dateness check can be very slow for large tree artifacts

Open lberki opened this issue 2 years ago • 2 comments

Description of the bug:

When checking whether a local action cache entry is up-to-date, it takes a long time to check actions that have large tree artifacts on their inputs. The stack trace when Bazel is working on this is:

   java.lang.Thread.State: RUNNABLE
        at java.io.FileInputStream.readBytes([email protected]/Native Method)
        at java.io.FileInputStream.read([email protected]/Unknown Source)
        at com.google.common.io.ByteStreams.copy(ByteStreams.java:114)
        at com.google.common.io.ByteSource.copyTo(ByteSource.java:257)
        at com.google.common.io.ByteSource.hash(ByteSource.java:340)
        at com.google.devtools.build.lib.vfs.FileSystem.getDigest(FileSystem.java:339)
        at com.google.devtools.build.lib.unix.UnixFileSystem.getDigest(UnixFileSystem.java:452)
        at com.google.devtools.build.lib.vfs.Path.getDigest(Path.java:690)
        at com.google.devtools.build.lib.vfs.DigestUtils.manuallyComputeDigest(DigestUtils.java:194)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.constructFileArtifactValue(ActionMetada
taHandler.java:564)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.constructFileArtifactValueFromFilesyste
m(ActionMetadataHandler.java:496)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.lambda$constructTreeArtifactValueFromFi
lesystem$0(ActionMetadataHandler.java:354)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler$$Lambda$1121/0x0000000800857040.visit(Unknown Source)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:411)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:393)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.constructTreeArtifactValueFromFilesystem(ActionMetadataHandler.java:342)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.getTreeArtifactValue(ActionMetadataHandler.java:317)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.getMetadata(ActionMetadataHandler.java:265)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.getMetadataOrConstant(ActionCacheChecker.java:566)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.getMetadataMaybe(ActionCacheChecker.java:579)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.validateArtifacts(ActionCacheChecker.java:207)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.mustExecute(ActionCacheChecker.java:541)

My theory is that this is because the visitation happens on a single thread in TreeArtifactValue.visitTree() when called from ActionMetadataHandler.constructTreeArtifactValueFromFilesystem().

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Take this BUILD file:

touch WORKSPACE
mkdir -p r
cat > r/BUILD <<'EOF'
load(":r.bzl", "r")

r(name = "ta")

genrule(
    name = "c",
    srcs = [":ta"],
    outs = ["co"],
    cmd = "find $(location :ta) > $@",
)

sh_binary(
    name = "gen",
    srcs = ["gen.sh"],
)
EOF

cat > r/r.bzl << 'EOF'
def _r_impl(ctx):
    ta = ctx.actions.declare_directory("d")
    ctx.actions.run(
        outputs = [ta],
        inputs = [],
        executable = ctx.executable._gen,
        arguments = [ta.path],
    )
    return [DefaultInfo(files = depset([ta]))]

r = rule(
    implementation = _r_impl,
    attrs = {
        "_gen": attr.label(default = "//r:gen", executable = True, cfg = "exec"),
    },
)
EOF

cat > r/gen.sh <<'EOF'
#!/bin/bash

OUT="$1"
mkdir -p "$OUT"

for i in $(seq 1 10); do
  for j in $(seq 1 10); do
    for k in $(seq 1 100); do
      mkdir -p "$OUT/$i/$j"
      #echo "$i $j $k" > "$OUT/$i/$j/$k"
      dd if=/dev/random of="$OUT/$i/$j/$k" bs=1024 count=1024
    done
  done
done
echo hello > "$OUT/hello"
EOF

chmod +x r/gen.sh

bazel build //r:c
bazel shutdown
bazel build //r:c  # This is slow

Which operating system are you running Bazel on?

Linux @ Google

What is the output of bazel info release?

development version

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

From git commit de4746da2394e49d028b5327ee5e65c6b60011cc .

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

lberki avatar Dec 13 '22 15:12 lberki

@coeuvre Might this be a possible application of Loom?

zhengwei143 avatar Dec 20 '22 10:12 zhengwei143

Perhaps io_uring might also help here?

zhengwei143 avatar Dec 20 '22 10:12 zhengwei143

@coeuvre Might this be a possible application of Loom?

Probably, by visiting the tree concurrently using virtual threads. It's not easy to do today because we have already created many platform threads.

Perhaps io_uring might also help here?

Definitely, but integrating io_uring into Bazel is another big project.

coeuvre avatar Jan 02 '23 14:01 coeuvre

Created https://github.com/bazelbuild/bazel/pull/17943 as a prototype to fix this.

lberki avatar Mar 31 '23 09:03 lberki

@bazel-io fork 6.2.0

keertk avatar Apr 25 '23 14:04 keertk

We ran into this again. This time, the stack traces show:

"tree-artifact-visitor-46" #1800 [144416] daemon prio=5 os_prio=0 cpu=28027.44ms elapsed=12463.90s tid=0x00007f
b740001e70 nid=144416 runnable  [0x00007fbb95efe000]
   java.lang.Thread.State: RUNNABLE
        at com.google.devtools.build.lib.vfs.inmemoryfs.InMemoryFileSystem.stat(InMemoryFileSystem.java:360)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.statUnchecked(RemoteActionFileSystem.java:612)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.resolveOneLink(RemoteActionFileSystem.java:533)
        at com.google.devtools.build.lib.vfs.FileSystem.appendSegment(FileSystem.java:377)
        at com.google.devtools.build.lib.vfs.FileSystem.resolveSymbolicLinks(FileSystem.java:445)
<the same like ten times>
        at com.google.devtools.build.lib.vfs.FileSystem.resolveSymbolicLinks(FileSystem.java:445)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.statUnchecked(RemoteActionFileSystem.java:594)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.stat(RemoteActionFileSystem.java:582)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.statIfFound(RemoteActionFileSystem.java:564)
        at com.google.devtools.build.lib.vfs.Path.statIfFound(Path.java:322)
        at com.google.devtools.build.lib.skyframe.ActionOutputMetadataStore.fileArtifactValueFromArtifact(ActionOutputMetadataStore.java:599)
        at com.google.devtools.build.lib.skyframe.ActionOutputMetadataStore.constructFileArtifactValue(ActionOutputMetadataStore.java:461)
        at com.google.devtools.build.lib.skyframe.ActionOutputMetadataStore.constructFileArtifactValueFromFilesystem(ActionOutputMetadataStore.java:449)
        at com.google.devtools.build.lib.skyframe.ActionOutputMetadataStore.lambda$constructTreeArtifactValueFromFilesystem$0(ActionOutputMetadataStore.java:298)
        at com.google.devtools.build.lib.skyframe.ActionOutputMetadataStore$$Lambda/0x00000008009c39f0.visit(Unknown Source)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue$Visitor.visitTree(TreeArtifactValue.java:548)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue$Visitor.lambda$visitTree$1(TreeArtifactValue.java:551)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue$Visitor$$Lambda/0x00000008009c4508.run(Unknown Source)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)

Which is indicative of a RemoteActionFileSystem being slow, even when called on multiple threads.

lberki avatar Feb 05 '24 12:02 lberki

I believe there are two distinct problems here:

  1. RemoteActionFileSystem#resolveSymbolicLinks is slow because it stats paths on the underlying filesystem one component at a time and there's no cache. So if you have a tree artifact with N files and every file has M path components, collecting the TreeArtifactValue incurs in O(N*M) stats. Adding a cache is possible, but the invalidation logic is nontrivial (we cannot assume that a path statted in the past won't change in the future, because files can be deleted or moved around within an action execution).
  2. TreeArtifactValue.visitTree parallelizes over subdirectories, not individual files. So if the tree artifact has a very large subdirectory with N files (or a single flat directory, in the worst case), collecting the TreeArtifactValue is still as slow as O(N) sequential stats.

I recently did some measurements on a large tree artifact containing 240k files totaling 12GB and large subdirectories with ~4k files (internal b/323077002 has the details) and I determined the following:

  • Currently, TreeArtifactValue.visitTree takes ~40s (~160us/file)
  • Adding a "dentry cache" for resolveSymbolicLinks brings it down to ~9s (disclaimer: prototype quality, doing it properly with correct invalidation might be slower)
  • Parallelizing across files instead of subdirectories (in addition to the dentry cache) brings it down to ~6s (~25us/file)
  • However, parallelizing across files does nothing on its own (it only compounds with the other optimization)

After applying both optimizations, the trace profile shows that visitTree is able to use all of the available cores during most of its execution, suggesting that they're sufficient to avoid blocking on I/O; further gains might be possible on the CPU side, but I haven't looked into it closely.

This is on a 32-core machine with an SSD, so numbers may vary depending on available parallelism and I/O performance.

tjgq avatar Feb 05 '24 14:02 tjgq

I think this has now been optimized as reasonably as it can.

tjgq avatar Feb 14 '24 23:02 tjgq