Materializer chokes on uploading action cache artifacts that exceed default 4MB gRPC limit
If you follow along along with the saga in https://github.com/thoughtpolice/buck2-nix/issues/12, I have almost got remote execution working on buildbarn with Nix! Except that the tarball I download seems to be a tad bit too large (formatted for legibility in the github UI):
Action failed: prelude//toolchains:nixpkgs-overlay-rust (download_tarball https://github.com/oxalica/rust-overlay/archive/1373567ffd13719f6b7522737b010bfc514d49b4.tar.gz)
Internal error (stage: download): action_digest=c8b86439a7e637e7c7d98cf2addf516a771ed82f:93:
Failed to declare in materializer: status: InvalidArgument, message:
"Attempted to read a total of at least 23508808 bytes, while a maximum of 16777216 bytes is permitted", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }
I can see that the action itself completed successfully in the buildbarn action cache UI, and that the file downloaded (given in the error above, from rust-overlay) is 23508808 bytes in size:
I'm not sure how to approach this. Maybe it's a gRPC limit? The CacheCapabilities don't exactly specify anything about this:
austin@GANON:~/src/reapi-server$ grpcurl -plaintext '127.0.0.1:8980' build.bazel.remote.execution.v2.Capabilities/GetCapabilities
{
"cacheCapabilities": {
"digestFunctions": [
"MD5",
"SHA1",
"SHA256",
"SHA256TREE",
"SHA384",
"SHA512"
],
"actionCacheUpdateCapabilities": {
"updateEnabled": true
},
"symlinkAbsolutePathStrategy": "ALLOWED"
},
"executionCapabilities": {
"digestFunction": "SHA256",
"execEnabled": true,
"executionPriorityCapabilities": {
"priorities": [
{
"minPriority": -2147483648,
"maxPriority": 2147483647
}
]
},
"digestFunctions": [
"MD5",
"SHA1",
"SHA256",
"SHA256TREE",
"SHA384",
"SHA512"
]
},
"lowApiVersion": {
"major": 2
},
"highApiVersion": {
"major": 2
}
}
The rounded 16 limit was curious so I poked around, and I was able to work around this error by increasing the message size for GRPC messages in buildbarn, it looks like:
diff --git a/buck/nix/bb/docker/config/common.libsonnet b/buck/nix/bb/docker/config/common.libsonnet
--- a/buck/nix/bb/docker/config/common.libsonnet
+++ b/buck/nix/bb/docker/config/common.libsonnet
@@ -38,7 +38,7 @@
},
browserUrl: 'http://localhost:7984',
httpListenAddress: ':80',
- maximumMessageSizeBytes: 16 * 1024 * 1024,
+ maximumMessageSizeBytes: 64 * 1024 * 1024,
global: {
diagnosticsHttpServer: {
listenAddress: ':9980',
Surely there's a better way in the reapi to download large action cache blobs incrementally without bloating the individual gRPC message limit, though...
Hmm, I will double check but I don’t recall the CAS GRPC API exposing range reads so it’s possible that raising this limit is indeed the right approach.
On Sat, 15 Apr 2023 at 19:09, Austin Seipp @.***> wrote:
The rounded 16 limit was curious so I poked around, and I was able to work around this error by increasing the message size for GRPC messages in buildbarn, it looks like:
diff --git a/buck/nix/bb/docker/config/common.libsonnet b/buck/nix/bb/docker/config/common.libsonnet --- a/buck/nix/bb/docker/config/common.libsonnet +++ b/buck/nix/bb/docker/config/common.libsonnet @@ -38,7 +38,7 @@ }, browserUrl: 'http://localhost:7984', httpListenAddress: ':80',
- maximumMessageSizeBytes: 16 * 1024 * 1024,
- maximumMessageSizeBytes: 64 * 1024 * 1024, global: { diagnosticsHttpServer: { listenAddress: ':9980',
Surely there's a better way in the reapi to download large action cache blobs incrementally without bloating the individual gRPC message limit, though...
— Reply to this email directly, view it on GitHub https://github.com/facebook/buck2/issues/170#issuecomment-1509895083, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANIHVVFNBHTKKKELSKPWCTXBLI4NANCNFSM6AAAAAAW7QDLWU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I tripped up on something else now too, message size decoding limit. ~~Looks like it's trying to download the entirety of the output artifact, which is about 160MB, while the gRPC default limit in tonic is 4MB(!) A bump to tonic for 0.9.0+ is needed to fix that, I think~~:
Action failed: prelude//toolchains/rust:rust-stable (nix_build rust-stable-rust-stable.nix)
Internal error (stage: remote_upload_error): Remote Execution Error (GRPC-SESSION-ID):
RE: upload: status: ResourceExhausted, message: "grpc: received message larger than max (165245018 vs. 4194304)", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }
Testing a patch for that now, possibly...
Ah, it's actually a BuildBarn error, I think. The problem isn't downloading, (which would fix the decoding size), it's uploading: https://grep.app/search?q=grpc%3A%20received%20message%20larger%20than%20max
So I think there the problem is actually writing smaller blobs to the store, not reading things out of it...
Ah, right. So the spec says https://github.com/thoughtpolice/reapi-server/blob/eae2e3a51bb8053ae7fe290fb1e9fedb462b651a/protos/build/bazel/remote/execution/v2/remote_execution.proto#L205-L209
// For small file uploads the client should group them together and call
// [BatchUpdateBlobs][build.bazel.remote.execution.v2.ContentAddressableStorage.BatchUpdateBlobs].
//
// For large uploads, the client must use the
// [Write method][google.bytestream.ByteStream.Write] of the ByteStream API.
So it instead needs to write content to google.bytestream.ByteStream, which you can verify exists on BuildBarn:
austin@GANON:~/src/reapi-server$ grpcurl -plaintext '127.0.0.1:8980' describe | grep 'is a service'
build.bazel.remote.execution.v2.ActionCache is a service:
build.bazel.remote.execution.v2.Capabilities is a service:
build.bazel.remote.execution.v2.ContentAddressableStorage is a service:
build.bazel.remote.execution.v2.Execution is a service:
google.bytestream.ByteStream is a service:
grpc.health.v1.Health is a service:
grpc.reflection.v1alpha.ServerReflection is a service:
It does look like we should use this API, yeah (for reads & writes), the RE bits are arguably MVP-level right now sicne we do not use them internally.
Is that something you'd wanted to do yourself since you've been debugging it? Otherwise I'll try to see if I can find the time
The code for all this is in remote_execution/oss/re_grpc/src/client.rs. This will also require adding the ByteStream service protobuf definitions at remote_execution/oss/re_grpc_proto/proto/google/bytestream/bytestream.proto
Yeah, I'll try giving it a shot, since the code looks fairly isolated.
I just "solved" this on the buildbarn side by setting maximumReceivedMessageSizeBytes under the grpc server section (not not the same as maximumMessageSizeBytes).
It's rather slow, so not a total fix, but it's something.
#193 has been merged, which should address this