"Big blob support" proposals summary
Currently, we have two open proposals that both aim to improve support for big blobs in Remote APIs.
Here is a quick summary of the 2 proposals. (please give both of them a read)
Split and Splice RPCs
A new set of RPCs is introduced as an extension of the existing CAS service. The server shall advertise whether they support the extension via new boolean fields in the CacheCapabilities message.
service ContentAddressableStorage {
...
rpc SplitBlob(SplitBlobRequest) returns (SplitBlobResponse) {}
rpc SpliceBlob(SpliceBlobRequest) returns (SpliceBlobResponse) {}
}
message CacheCapabilities {
...
repeated ChunkingAlgorithm.Value supported_chunking_algorithms = 8;
bool blob_split_support = 9;
bool blob_splice_support = 10;
}
If the client has a big blob digest, it can call Split() to get back a list of chunks from the server.
message SplitBlobRequest {
...
Digest blob_digest = 2;
...
}
message SplitBlobResponse {
repeated Digest chunk_digests = 1;
...
}
If the client has split a big blob by itself into chunks, it can call Splice() to tell the server to put those chunks together into a big blob.
message SpliceBlobRequest {
...
// Expected digest of the spliced blob.
Digest blob_digest = 2;
// The ordered list of digests of the chunks which need to be concatenated to
// assemble the original blob.
repeated Digest chunk_digests = 3;
...
}
message SpliceBlobResponse {
// Computed digest of the spliced blob.
Digest blob_digest = 1;
...
}
The current proposal included a definition of supported chunking algorithms and advertising them through CacheCapabilities, but through recent conversations, that will likely be removed in the final version.
Remote Execution Manifest Blobs
A new SHA256Encoded Digest function is introduced.
When using this new SHA256Encoded, each blob is expected to come with a small fixed-size header chunk to help identify whether it's a "normal" blob or if it's a "manifest" blob.
A manifest blob includes a list of digests inside, each pointing to the chunk that can be used to put the large blob back together.
Key takeaways
-
Overall: Both proposals are sound and sufficient to support a large blob in Remote API. However, both tried to avoid specifying how the blobs should be chunked. The only assumption that was made by both proposals was that implementations should be able to put the big blob back together by concatenating the small chunks orderly.
In practice, we noted that if the build client/server/worker does not agree on a chunking algorithm, both proposals should still work. However, an agreement on the chunking algorithm is expected to improve cache hit rates and reduce the data transfer over the network.
-
Adoption cost: There was concern about adoption costs when implementing these new proposals back in BazelCon.Worth noting that both proposals require client/server implementations to write some extra code to support it so there will be some cost to adoption.
In particular, the
SHA256Encodedproposal uses a new digest function which means that both the client and server need to implement it. The existing remote cache entries cannot be re-used with this proposal, so there could be a brief increase in cache storage requirements for the existing system (or shorter cache entry ttl during the migration to the new digest function).Meanwhile, the Split/Splice RPCs are compatible with existing SHA256/BLAKE3 digest function cache entries. It is also worth noting that Remote APIs do not dictate how the Worker APIs should be designed. However, for setups in which Workers use Remote APIs to communicate with Remote Cache. There can be partial benefits for older clients doing remote builds against server+worker which uses the new RPCs to speed up remote actions.
-
Performance vs Verification: During a recent conversation, we noted that the current SHA256Encoded does not require implementations to hash the large blob to compute its digest. Instead, the large blob can only be accessed through the manifest and its digest. Not having to hash the larger blobs could save some amount of compute power and therefore, speed up builds with many larger blobs. However, because the big blob digest is not stored anywhere in the system, there won't be a way to verify the concat operation on smaller chunks when writing the big blob to disk. This implies a certain level of trust in client/worker implementations to concat these chunks correctly.
In contrast, Split and Splice RPCs require implementations to hash the larger blob for its digest. The hashing operation can be slow but provides a way to verify the concat operation if needed. As both the large blob digest and manifest can be used, the build may require additional RPC calls to translate between different digests <-> manifests. The performance cost of making these additional network round trips vs downloading the big blob as-is can be non-trivial to predict.
Note: the SHA256Encoded proposal can add an optional digest field into the manifest to support verifying the concat operation.
Here are my personal opinions:
I think between the 2 proposals, the SHA256Encoded is simpler to implement and in most cases, faster to operate. This is mainly thanks to not having to hash the large blob to compute its digest. If verification is needed, perhaps a blake3encoded and an optional digest chunk would be very attractive to us. Blake3 does not demand a special CPU SHA extension to hash large blobs effectively.
With that said, I think Split/Splice RPCs would have a higher impact on our users today. Many of Bazel users are still using Bazel 6/7, some are still on Bazel 5 despite the LTS period for this version has already expired. Migrating our users to newer Bazel versions, or any alternative build tool/wrapper around Bazel (i.e. our internal CLI) could be expensive and slow. If the Split/Splice RPCs proposal is not accepted, most likely some server/worker would still implement it (or part of it) as their Worker protocol to improve their user's experience immediately.
If either or both proposals are accepted, I think it's critical for the clients and server (and potentially workers) to agree on a chunking algorithm. As mentioned earlier, we have limited control over our user's client upgrade so rolling out a newer client version is not a viable option to adjust the chunking algorithm. We also have a wide range of users with different kinds of big blobs. Some examples that are mentioned in different meetings and issues:
- Container Images (mostly tar.gz balls)
- Mobile app packages
- Binary blob executable (tests, deployable)
- OS images, VM snapshots
- LLM models, data files
- etc...
so I expected that some chunking algorithms might better suit some customers while some might not. Therefore, it's important that we let the server "hint" the best chunking algorithm to improve the overall cache hit rate and reduce data transfer. I am ok with the client ignoring the "hinted" information at the cost of lower cache hit rates and degraded performance.
I would appreciate some additional documentation on chunking requirements. In particular, we should be explicit about the "concatenation" assumption to eliminate chunking algorithms such as Reed Solomon (aka Erasure Coding). I would also appreciate it if we could provide some discussion/documentation around how chunking should be applied when compression is supported:
- Should we chunk the big blob before compressing it or after?
- Should we compress the chunks? etc...
cc: @bergsieker @roloffs @ulfjack
I will be sending out a Google Meet invitation later so that the Working Group can discuss and decide on this. The invitation shall target Feb 11, the 1-hour time slot after monthly the Remote Execution API meeting. If that time slot does not work, please propose 2-3 new time slots to replace.
If anyone else is interested in joining the call, please comment below your email (or let me know on BazelBuild Slack) and I will forward the invite.
Please add me.
cc: @aehlig
Thanks for the good summary of the proposals. I agree that we probably want both approaches.
I'd like to just add a remark about the agreement on the chunking algorithm. In my opinion it plays quite a different role in the two approaches.
- For the split/splice proposal, the agreement on a splitting algorithm is a question of performance; the more involved parties agree, the better the deduplication and hence the saved traffic. But even if some client splits differently, there is at least some saving, and, more importantly everything is correct.
- For the manifest proposal, on the other hand, the splitting algorithm is at the core of the definition. Since we hash the manifest (and not the denoted sequence of bytes of the underlying blob), the hash depends on the splitting algorithm used. So, essentially, the splitting algorithm is part of the definition of the manifest hash. (Recall that the property that every blob has precisely one hash (for a fixed hashing algorithm) is an essential property of a CAS that the the key addressing the object is a function of the content.)
Thanks @sluongng for your nice summary of the proposals, please add me as well.
I sent out the meeting invites. Also added @moroten as he seems interested in one of the proposals.
Some notes:
- Downloads seem more important than uploads; on one of our larger single-customer clusters, the ratio of downloads to uploads (with bwotb enabled) is ~700 TB/10TB.
- Split/splice could be implemented today via
Execution/Execute(similar to the Asset API; I think this is an example: https://github.com/buildbarn/bb-remote-asset/blob/master/pkg/fetch/remote_execution_fetcher.go) - Eager vs. lazy splitting; I think initially we'll see very few split requests as clients don't support this yet. That makes eager splitting costly as most of the result is never used.
- Note that a client can download chunks of files today via
ByteStream/Read; it is unclear if the the current split API proposal requires having the chunks in the CAS individually (available via API, not necessarily separately stored in the backend).
During today's meeting, some discussion was around max blob size. To put some concrete examples to this, today I'm (remotely) building a single test target (fastbuild). The runfiles tree is 1.7G and the largest single file is 270G. (largest 5 files: 270,770,176 (data file) 181,396,232 (binary) 100,462,464 (binary) 62,559,952 (.so) 40,726,976 (.so))
At the moment I'm having headaches from this when the downloads are interrupted and I have to start over (on bazel 6.5.0 for a few more weeks) so resumable downloads would be helpful.
Also generic handwaving: I've seen compiling with -g causing a 10x binary size explosion.
A better example is that docker images can be ginormous, so I would support 100GB support "today" in anticipation that people will still be using early implementations in 5 years' time.
https://stackoverflow.com/questions/79432504/bazel-digest-calculation-of-big-sparse-files-is-slow/ this question seems to be relevant to this thread.
@ulfjack
Split/splice could be implemented today via Execution/Execute (similar to the Asset API; I think this is an example: https://github.com/buildbarn/bb-remote-asset/blob/master/pkg/fetch/remote_execution_fetcher.go)
This is true. But that's just one way of implementing it right?
I don't think it's worth dropping Split/Splice API and forcing folks to use Execute API.
Note that a client can download chunks of files today via ByteStream/Read; it is unclear if the the current split API proposal requires having the chunks in the CAS individually (available via API, not necessarily separately stored in the backend).
yes it does. Each chunk will be an individual blob that can be referred to by its digest. Having all the chunks available in CAS will help us deduplicate chunks between multiple versions of the large blob.
@ulfjack
- Split/splice could be implemented today via
Execution/Execute(similar to the Asset API; I think this is an example: https://github.com/buildbarn/bb-remote-asset/blob/master/pkg/fetch/remote_execution_fetcher.go)
As @sluongng already mentioned, this is one implementation, but not necessarily a desirable one.
- Eager vs. lazy splitting; I think initially we'll see very few split requests as clients don't support this yet. That makes eager splitting costly as most of the result is never used.
If clients do not make use of split results, then why doing splitting at the server side at all? The intention of our API extension was to split only on request.
- Note that a client can download chunks of files today via
ByteStream/Read; it is unclear if the the current split API proposal requires having the chunks in the CAS individually (available via API, not necessarily separately stored in the backend).
Exactly as @sluongng explained. How will you otherwise check which chunks are locally (in case of split) or remotely (in case of splice) available to download/upload only the missing ones?
@leftsock
At the moment I'm having headaches from this when the downloads are interrupted and I have to start over (on bazel 6.5.0 for a few more weeks) so resumable downloads would be helpful.
With our API proposal #282 , you could request to split your GBs large file. This would return you a list of chunks (smaller blobs, probably a couple of KBs or MBs of size) which you could download individually. If the download gets interrupted, you can again request the list of chunks with a split request and check which chunks are locally already available and only download the missing chunks from remote.
In this sense, our API proposal allows partial downloading.
(sorry for dropping in out of the blue - I work on CI infrastructure in/around the Chrome/Android world at Google and am painfully familiar with the CAS system here, even though I haven't piped up on GH here before :). I also worked with Steven a fair amount over the past couple years and we had chatted about this a bit while he was drafting his proposal.)
Re: the manifest approach, I think one of the biggest downsides of the manifest proposal as currently written is requiring a new hash type; This puts a burden on the clients and servers, and it also doesn't scale well with other Digests (e.g. what if you wanted this with MURMUR3 hashes?).
Ideally, I think, we could make a change here that concentrates the majority of the implementation into clients (uploaders/downloaders) without any big tweaks to the service implementations.
I was wondering if there had been consideration for extending the FileNode object to represent a chunked file by:
- Add a new field
repeated Digest chunks = XtoFileNode, and another new fieldbool chunked_only = X. - If
chunked_onlyis false, reading clients can choose to pull chunk blobs or FileNode.digest directly via existing APIs. If it's true, then clients may only expect to be able to pull the chunks (i.e. the server may incidentally have the full blob, but the client didn't push it). - Concatenating the chunks in order of
FileNode.chunksmust yield theFileNode.digestvalue. - Uploaders may pick whatever chunking mechanism they want to (e.g. FastCDC or some content-aware chunking for zip or other structured archive formats, etc.).
- The simplest implementation of upload would be to have the client just directly push the chunks and/or full content (if chunked_only=false) as needed. The expectation would be that if
chunksis set, then the client pushed those chunks. - Clients could fallback from chunks to the main digest in cases of a chunk being missing from the server for some reason. Given the FileNode and the whole file, the client could also precisely regenerate the chunks from just the whole-file content without needing the original chunk function (because the
chunksDigests already encodes each chunk size).
The benefit of this would be that very little server support would be needed (just a two field additions, and if the server chases CAS trees for some reason, it would have to chase through this new field, too), and it would work out of the box with all existing digest functions.
Example (FileNode representing a chunked file where the client did not upload "abce...def0"):
name: "name_of_concatenated_file"
digest { hash: "abce...def0" size: 777777}
chunks { hash: "469c...e65d" size: 123456}
chunks { hash: "a191...1dfc" size: 654321}
chunked_only: true
There is an alternate version here (which I think would be a bit awkward, but may be workable, and would require zero server support at all) to add a new node_property to a Directory which effectively says "I represent a chunked file", and then require that the Directory's files list be alphanumerically sorted chunks of the target file. However, this is a bit weird (Directories can now sometimes be files), and will run into the issue that Directory doesn't have is_executable, which probably would still be needed for chunked files (though this is maybe solvable with node_properties, possibly with additional awkwardness).
If chunked_only is false, reading clients can choose to pull chunk blobs or FileNode.digest directly via existing APIs. If it's true, then clients may only expect to be able to pull the chunks (i.e. the server may incidentally have the full blob, but the client didn't push it).
Does that mean only newer clients can use the proposed change? What should an older client do when they get an action cache hit with ActionResult -> Directory -> FileNode with chunked_only = true?
Worth noting that it's not true that the server will not have to change: they would still need to validate for all the chunks to be available in CAS before returning the Action Result. So some server-side code change is still needed.
So I am not very clear on the motivation here.
Does that mean only newer clients can use the proposed change? What should an older client do when they get an action cache hit with ActionResult -> Directory -> FileNode with chunked_only = true?
If chunked_only is false, then it would be compatible with old clients. My assumption would be that to roll this out you could roll out all clients in chunked_only=false mode, until all clients are updated, then turn it to chunked_only=true.
Worth noting that it's not true that the server will not have to change: they would still need to validate for all the chunks to be available in CAS before returning the Action Result. So some server-side code change is still needed.
Right, this is what I meant by if the server chases CAS trees for some reason, it would have to chase through this new field, too. If the server is validating the blob set, or wants to validate that cat{chunks} == digest, it would have to be aware of this field.
The motivation is that it doesn't need to introduce a new Digest type, and as other Digest types are supported by the server/client, they will be supported by this as well (without having to double all Digest types to XXX and XXXEncoded variants).
(what I suggested also permanently retains the big blob digest to help with verification across the board)