buck2 icon indicating copy to clipboard operation
buck2 copied to clipboard

Materializer chokes on uploading action cache artifacts that exceed default 4MB gRPC limit

Open thoughtpolice opened this issue 2 years ago • 5 comments

If you follow along along with the saga in https://github.com/thoughtpolice/buck2-nix/issues/12, I have almost got remote execution working on buildbarn with Nix! Except that the tarball I download seems to be a tad bit too large (formatted for legibility in the github UI):

Action failed: prelude//toolchains:nixpkgs-overlay-rust (download_tarball https://github.com/oxalica/rust-overlay/archive/1373567ffd13719f6b7522737b010bfc514d49b4.tar.gz)

Internal error (stage: download): action_digest=c8b86439a7e637e7c7d98cf2addf516a771ed82f:93:
Failed to declare in materializer: status: InvalidArgument, message:

"Attempted to read a total of at least 23508808 bytes, while a maximum of 16777216 bytes is permitted", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }

I can see that the action itself completed successfully in the buildbarn action cache UI, and that the file downloaded (given in the error above, from rust-overlay) is 23508808 bytes in size:

screenshot-buildbarn

I'm not sure how to approach this. Maybe it's a gRPC limit? The CacheCapabilities don't exactly specify anything about this:

austin@GANON:~/src/reapi-server$ grpcurl -plaintext '127.0.0.1:8980' build.bazel.remote.execution.v2.Capabilities/GetCapabilities
{
  "cacheCapabilities": {
    "digestFunctions": [
      "MD5",
      "SHA1",
      "SHA256",
      "SHA256TREE",
      "SHA384",
      "SHA512"
    ],
    "actionCacheUpdateCapabilities": {
      "updateEnabled": true
    },
    "symlinkAbsolutePathStrategy": "ALLOWED"
  },
  "executionCapabilities": {
    "digestFunction": "SHA256",
    "execEnabled": true,
    "executionPriorityCapabilities": {
      "priorities": [
        {
          "minPriority": -2147483648,
          "maxPriority": 2147483647
        }
      ]
    },
    "digestFunctions": [
      "MD5",
      "SHA1",
      "SHA256",
      "SHA256TREE",
      "SHA384",
      "SHA512"
    ]
  },
  "lowApiVersion": {
    "major": 2
  },
  "highApiVersion": {
    "major": 2
  }
}

thoughtpolice avatar Apr 15 '23 16:04 thoughtpolice

The rounded 16 limit was curious so I poked around, and I was able to work around this error by increasing the message size for GRPC messages in buildbarn, it looks like:

diff --git a/buck/nix/bb/docker/config/common.libsonnet b/buck/nix/bb/docker/config/common.libsonnet
--- a/buck/nix/bb/docker/config/common.libsonnet
+++ b/buck/nix/bb/docker/config/common.libsonnet
@@ -38,7 +38,7 @@
   },
   browserUrl: 'http://localhost:7984',
   httpListenAddress: ':80',
-  maximumMessageSizeBytes: 16 * 1024 * 1024,
+  maximumMessageSizeBytes: 64 * 1024 * 1024,
   global: {
     diagnosticsHttpServer: {
       listenAddress: ':9980',

Surely there's a better way in the reapi to download large action cache blobs incrementally without bloating the individual gRPC message limit, though...

thoughtpolice avatar Apr 15 '23 17:04 thoughtpolice

Hmm, I will double check but I don’t recall the CAS GRPC API exposing range reads so it’s possible that raising this limit is indeed the right approach.

On Sat, 15 Apr 2023 at 19:09, Austin Seipp @.***> wrote:

The rounded 16 limit was curious so I poked around, and I was able to work around this error by increasing the message size for GRPC messages in buildbarn, it looks like:

diff --git a/buck/nix/bb/docker/config/common.libsonnet b/buck/nix/bb/docker/config/common.libsonnet --- a/buck/nix/bb/docker/config/common.libsonnet +++ b/buck/nix/bb/docker/config/common.libsonnet @@ -38,7 +38,7 @@ }, browserUrl: 'http://localhost:7984', httpListenAddress: ':80',

  • maximumMessageSizeBytes: 16 * 1024 * 1024,
  • maximumMessageSizeBytes: 64 * 1024 * 1024, global: { diagnosticsHttpServer: { listenAddress: ':9980',

Surely there's a better way in the reapi to download large action cache blobs incrementally without bloating the individual gRPC message limit, though...

— Reply to this email directly, view it on GitHub https://github.com/facebook/buck2/issues/170#issuecomment-1509895083, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANIHVVFNBHTKKKELSKPWCTXBLI4NANCNFSM6AAAAAAW7QDLWU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

krallin avatar Apr 15 '23 18:04 krallin

I tripped up on something else now too, message size decoding limit. ~~Looks like it's trying to download the entirety of the output artifact, which is about 160MB, while the gRPC default limit in tonic is 4MB(!) A bump to tonic for 0.9.0+ is needed to fix that, I think~~:

Action failed: prelude//toolchains/rust:rust-stable (nix_build rust-stable-rust-stable.nix)

Internal error (stage: remote_upload_error): Remote Execution Error (GRPC-SESSION-ID):
RE: upload: status: ResourceExhausted, message: "grpc: received message larger than max (165245018 vs. 4194304)", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }

Testing a patch for that now, possibly...

thoughtpolice avatar Apr 15 '23 18:04 thoughtpolice

Ah, it's actually a BuildBarn error, I think. The problem isn't downloading, (which would fix the decoding size), it's uploading: https://grep.app/search?q=grpc%3A%20received%20message%20larger%20than%20max

So I think there the problem is actually writing smaller blobs to the store, not reading things out of it...

thoughtpolice avatar Apr 15 '23 18:04 thoughtpolice

Ah, right. So the spec says https://github.com/thoughtpolice/reapi-server/blob/eae2e3a51bb8053ae7fe290fb1e9fedb462b651a/protos/build/bazel/remote/execution/v2/remote_execution.proto#L205-L209

// For small file uploads the client should group them together and call
// [BatchUpdateBlobs][build.bazel.remote.execution.v2.ContentAddressableStorage.BatchUpdateBlobs].
//
// For large uploads, the client must use the
// [Write method][google.bytestream.ByteStream.Write] of the ByteStream API.

So it instead needs to write content to google.bytestream.ByteStream, which you can verify exists on BuildBarn:

austin@GANON:~/src/reapi-server$ grpcurl -plaintext '127.0.0.1:8980' describe | grep 'is a service'
build.bazel.remote.execution.v2.ActionCache is a service:
build.bazel.remote.execution.v2.Capabilities is a service:
build.bazel.remote.execution.v2.ContentAddressableStorage is a service:
build.bazel.remote.execution.v2.Execution is a service:
google.bytestream.ByteStream is a service:
grpc.health.v1.Health is a service:
grpc.reflection.v1alpha.ServerReflection is a service:

thoughtpolice avatar Apr 15 '23 19:04 thoughtpolice

It does look like we should use this API, yeah (for reads & writes), the RE bits are arguably MVP-level right now sicne we do not use them internally.

Is that something you'd wanted to do yourself since you've been debugging it? Otherwise I'll try to see if I can find the time

The code for all this is in remote_execution/oss/re_grpc/src/client.rs. This will also require adding the ByteStream service protobuf definitions at remote_execution/oss/re_grpc_proto/proto/google/bytestream/bytestream.proto

krallin avatar Apr 17 '23 17:04 krallin

Yeah, I'll try giving it a shot, since the code looks fairly isolated.

thoughtpolice avatar Apr 17 '23 22:04 thoughtpolice

I just "solved" this on the buildbarn side by setting maximumReceivedMessageSizeBytes under the grpc server section (not not the same as maximumMessageSizeBytes).

It's rather slow, so not a total fix, but it's something.

benbrittain avatar Apr 21 '23 19:04 benbrittain

#193 has been merged, which should address this

benbrittain avatar May 02 '23 19:05 benbrittain