remote-apis
remote-apis copied to clipboard
Add blob split and splice API
This is a proposal of a conservative extension to the ContentAddressableStorage
service, which allows to reduce traffic when blobs are fetched from the remote CAS to the host for local usage or inspection. With this extension it is possible to request a remote-execution endpoint to split a specified blob into chunks of a certain average size. These chunks are then stored in the CAS as blobs and the ordered list of chunk digests is returned. The client can then check, which blob chunks are available locally from earlier fetches and fetch only the missing chunks. By using the digest list, the client can splice the requested blob from the locally available chunk data.
This extension could especially help to reduce traffic if large binary files are created at the remote side and needed locally such as executables with debug information, comprehensive libraries, or even whole file system images. It is a conservative extension, so no client is forced to use it. In our build-system project justbuild, we have implemented this protocol extension for server and client side.
Though this proposal may solve certain annoyances, it does not fix some other issues:
- The inability to upload files by uploading missing chunks. Given the fact that uploading is often slower than downloading (using typical internet connections provided by ISPs), I would argue that this is more important than downloading.
- The inability for clients to access files at random, without sacrificing the ability to check file integrity.
I would rather see that we try to solve these issues as part of REv3, by working towards eliminating the existence of large CAS objects in their entirety. See this section in the proposal I've been working on:
https://docs.google.com/document/d/1FxpdOzOhzOCTjjn2loppMlBzjqjU9WpYF4E1K6opxVI/edit#heading=h.3ix2ngn3uabo
This is quite similar to https://github.com/bazelbuild/remote-apis/pull/233/files. @roloffs have you had a chance to review the PR prior and related discussions?
I think both PRs approaching this by adding a separate RPC, which is good and V2 compatible. The V3 support for this could be discussed separately and should not block V2 improvement as long as it’s backward compatible?
@EdSchouten, thanks for sharing the link to your REv3 discussion document and your comments about this proposal. Sorry for not being aware of this document. After going through it, I agree with you that this API extension would not make much sense in REv3 given your proposal in the "Elimination of large CAS objects" section. However, as also @sluongng stated, since this extension is conservative and backwards compatible to REv2 and the release of REv3 is very uncertain right now, it would not harm people that do not use it, but would provide advantages already now for people who use it and could also lead to insights for your content-defined chunking ideas for REv3, since we also used such an algorithm to split blobs.
I also agree with your concerns that uploading large files is something not covered by this proposal, while there exist relevant use-cases for this. However, I can think of a symmetric SpliceBlob rpc to allow for splitting a large blob on the client side, uploading only those parts of this blob that are missing on the server side, and then splicing there. This could be added in this PR as well.
@sluongng, thanks for pointing out this PR. Despite the fact they look very similar they actually target complementary goals. Let me explain why. While the PR from @EdSchouten introduces split and combine blobs rpcs, the goal is not to safe traffic but to introduce a blob splitting scheme, which allows to verify the integrity of a blob by validating the digests of its chunks without actually reading the whole chunk data. In order to achieve this, he introduced a new digest function SHA256TREE, which allows recursive digest calculation. I hope I did not completely misunderstood your intention @EdSchouten. In contrast, the presented splitting scheme targets reuse as much as possible with the final goal of traffic reduction between client and server. E.g., if a large binary in the remote CAS was just modified slightly and you want to use it locally, you would have to download it completely. Using the presented extension, only the binary differences between the two versions determined by content-defined chunking would have to be downloaded, which is typically much less than the whole data. As I said both splitting schemes are actually complementary and follow different goals.
I think what's missing in this PR was a specification regarding how the splitting algorithm would look like, and the ability to choose different algorithms for the job.
In #233 , the chunking algorithm was mixed with the Digest algorithm, which I think is a good start as it's customizable. But I definitely can see cases where the Digest algorithm and Chunking algorithm are separated for different combinations (I.e. reed solomon + blake3, FastCDC + SHA256, delta compression + GITSHA1 etc...). And each combination could serve different purposes (deduplication, download parallelization, etc...).
It would be nice if you could provide a bit more detail regarding your splitting algorithm of choice as an option here.
While the actual choice of the splitting algorithm is mainly an implementation detail of the remote-execution endpoint (which of course affects the quality of the split result), the essential property of a server is to provide certain guarantees to a client if it successfully answers a SplitBlob
request:
- The blob chunks are stored in CAS.
- Concatenating the blob chunks in the order of the digest list returned by the server results in the original blob.
Besides this guarantee, in order to increase the reuse factor as much as possible between different versions of a blob, it makes sense to implement a content-defined chunking algorithm. They typically result in chunks of variable size and are insensitive to the data-shifting problem of fixed-size chunking.
Such content-defined chunking algorithms typically rely on a rolling-hash function to efficiently compute hash values of consecutive bytes at every byte position in the data stream in order to determine the chunk boundaries. Popular algorithms for content-defined chunking are:
- Polynomial rolling hash
- Rabin fingerprint
- Cyclic polynomial rolling hash (also called Buzhash)
- Fast rolling Gear hash algorithm (also called FastCDC)
I have selected FastCDC as chunking algorithm for the endpoint implementation in our build system, since it has been proven to be very compute efficient and faster than the other rolling-hash algorithms while achieving similar deduplication ratios as the Rabin fingerprint. We already observed reuse factors of 96-98% for small changes, when working with big file-system images (around 800 MB) and also of 75% for a 300 MB executable with debug information.
Maybe, you want to have a look at our internal design document for more information about this blob-splitting API extension.
Ah I think I have realized what's missing here. Your design seems to be focusing on splitting the blob on the server side for the client to download large blobs. While I was thinking that blob splitting could happen to both the client side and the server side.
For example: a game designer may work on some graphic assets, say a really large picture. Subsequent versions of a picture may get chunked on the client side. Then the client can compare the chunk list with the chunks that are already available on the server-side, and only upload the parts that are missing.
So in the case where both client and server have to split big blobs for efficient download AND upload, it's beneficial for 2 sides to agree upon how to split (and put back together) big blobs.
Yes, you are right, this design currently focuses on splitting on the server side and downloading large blobs, but as mentioned in a comment above, I am willing to extend this design proposal by a SpliceBlob
rpc to handle chunked uploads. This allows splitting a blob on the client side, uploading chunks that are missing at the server, and then splicing the original blob there.
Maybe it is worth to mention that in this case, it is not necessarily required for client and server to agree upon the same splitting algorithm, since after the first round-trip overhead, the chunking algorithm for each direction anyway ensures an efficient reuse.
I will update this proposal to handle uploads for you to review. Thank you very much for your interest and nice suggestions.
Do keep in mind that there could be mixed usage of clients (a) with chunking support and clients (b) without chunking support.
So I do believe a negotiation via the initial GetCapability RPC, similar to the current Digest and Compressor negotiation, is much desirable. As the server would need to know how to put a split blob upload, from (a), back together to serve it to (b).
I would recommend throwing the design ideas into https://github.com/bazelbuild/remote-apis/issues/178. It's not yet settled whether chunking support needs to be a V3 exclusive feature, or we could do it as part of V2. Discussion to help nudge the issue forward would be much appreciated.
@sluongng I have updated the PR with a more sharpened description of what is meant by and what is the goal of this blob-splitting approach and a proposal for the chunked upload of large blobs.
Some thoughts about your hints regarding the capabilities negotiation between client and server:
- If a client does not support blob splitting, it would not call
SpliceBlob
at the server, but still can callSplitBlob
at the server if the server supports it (an evaluation of theblob_split_support
flag is enough to find this out). Then it is up to the server how to split the blob, it just needs to give the guarantees mentioned earlier. - If a client supports blob splitting, it could upload chunks to the server and use the
SpliceBlob
operation if the server supports it (an evaluation of theblob_splice_support
flag is enough to find this out). Whereby supporting the splice operation mainly means being able to concatenate chunks of data specified by the uploaded list of chunk digests in the given order. The client could even exploit domain-specific knowledge (that the server does not have) to split its blobs into chunks to improve traffic reduction of subsequent uploads of modified blobs. Furthermore, it can also useSplitBlob
at the server if the server supports it.
This means, each side is responsible for its chunking approach without having the other side to know about it. The other side just needs to be able to concatenate the chunks. Furthermore, it would be difficult to agree, e.g., on the same FastCDC algorithm, since this algorithm internally depends on an array of 256 random numbers (generated by the implementer) and thus could result in completely different chunk boundaries for two different implementations preventing any reuse between the chunks on the server and the client.
I will also put a summary of this blob splitting and splicing concept into #178. Would be nice if this concept could find its way into REv2 since it is just an extension free to use and no invasive modification.
Do give https://github.com/bazelbuild/remote-apis/issues/272 and my draft PR a read on how client/server could negotiate for a spec leveraging GetCapabilities rpc. Could be useful if you want to have a consistent splitting scheme between client and server.
Hello @sluongng, I have updated the proposal by using the capabilities service as you have proposed. It is now possible for a client to determine the supported chunking algorithms at the server side and select one at a SplitBlob request. By this means, the client can select one that it also uses locally so that both communication directions benefit from the available chunking data on each side. Furthermore, I have added some comments about lifetime of chunks. Thanks for your time reviewing this PR!
Hello all,
after spending quite some time working on this proposal and its implementation, I have finished incorporating all suggestions made by the reviewers and that came up during the working group meeting. Finally, the following high-level features would be added to the REv2 protocol:
- Optional split operation at the CAS service to reduce traffic for the download direction.
- Optional splice operation at the CAS service to reduce traffic for the upload direction.
- Negotiation mechanism between client and server to agree about the used chunking algorithm.
This whole proposal is fully implemented in our own remote-execution implementation in justbuild:
and used by the just client:
From my side, this proposal is finished and ready for final review. What I would like to know from you is what needs now to be done that this proposal finally gets merged into main. I can also summarize it again at the next working group meeting and at best would like to know a decision how to proceed with this proposal. Thank you very much for your efforts.
@EdSchouten, @sluongng, @bergsieker
@mostynb, as far as I have understood the protocol, no. While the bytestream API with read_offset and read_limit allows you to partially read the content of a blob, it does not allow you to create a new blob from a batch of other blobs (its chunks) at the remote CAS.
The goal of blob splicing is that if a client regularly uploads slightly different large objects to the remote CAS, only the binary differences between the versions are needed to be uploaded and not the entire block of binary data every time. To achieve this, the client needs to split the large object into reusable chunks (which is typically done by content-defined chunking) and just uploads the chunks (handled as blobs) that are missing at the remote CAS, which are normally a lot when uploading the first time. If the client needs to upload this large object again, but a slightly different version of it (meaning only a percentage of the binary data has been changed), he again splits it into chunks and tests which chunks are missing at the remote CAS. Normally, content-defined chunking splits the binary data that hasn't been changed into the same set of chunks, only where binary differences occur, different chunks will be created. This means, only a fraction of the whole set of chunks need to be uploaded to the remote CAS in order to be able to reconstruct the second version of the large object at the remote CAS. The actual reconstruction of a large blob at the remote side is done using the splice command with a description of which chunks need to be concatenated (a list of chunk digests available at the remote CAS).
The split operation works exactly the other way around, when you regularly download an ever changing large object from remote CAS. Then, the server splits the large object into chunks, the client fetches only the locally missing chunks and reconstructs the large object locally from the locally available chunks.
Finally, to exploit chunking for both directions at the same time, it makes sense that the client and the server agree on a chunking algorithm to allow reusing chunks created on both sides. For this, we added a negotiation mechanism to agree on the chunking algorithm used on both sides.
Hi @roloffs,
Earlier today I spent some time experimenting with FastCDC. In addition to implementing Algorithm 2 "FastCDC8KB" from the paper, I also wrote something corresponding to the following:
Input: data buffer, src; buffer length, n
Output: chunking breakpoint bestOffset
/* The same initialization as the original algorithm. */
MinSize = ...;
MaxSize = ...;
fp = 0;
if n <= MinSize then
return n;
if n >= MaxSize then
n = MaxSize;
/* Compute the rolling hash at the start of the cutoff range. */
i = MinSize - 64;
for ; i < MinSize; i++; do
fp = (fp << 1) + Gear[src[i]];
/* Test if the cutoff range contains points for which the rolling hash is higher. */
bestFP = fp
bestOffset = MinSize
for ; i < MaxSize; i++; do
fp = (fp << 1) + Gear[src[i]];
if bestFP < fp then
bestFP = fp;
bestOffset = i + 1;
return bestOffset;
The following thought has gone into this variant:
-
As we saw above, the expected chunk size of plain FastCDC8KB resembles a normal distribution. I fail to see why this should be preferred over a uniform distribution. It means that if cuts start to diverge due to differing file contents, it also takes potentially longer for the cuts to converge after the file contents become equal again. This due to the existence of a potential period/cadence in cuts.
-
With FastCDC8KB, if no suitable location can be found to make a cut, the algorithm simply returns $MaxSize$. For Rabin fingerprinting the probability of hitting $MaxSize$ tends to be relatively high, but FastCDC's use of $MaskL$ acts somewhat as a safeguard. That said, if we do reach the maximum it stops being "content defined" chunking altogether. The poor selection of the cut could potentially trickle through into subsequent chunks/cuts. By always placing the cut at the point with the highest $fp$ instead of hoping to see a point where the desired number of bits is clear, the chunking remains "content defined".
-
FastCDC8KB only uses the middle 32 bits of $fp$. This is a shame, because the top 16 bits tend to be the most valuable ones (being computed from up to 64 previous bytes of data).
If implemented naively, the algorithm above will have a worse worst-case running time. Namely, we always compute $fp$ over the full $MaxSize$ bytes of input, even though the resulting chunk may be closer to $MinSize$. This means that the worst-case running time of the algorithm above is $O(n \cdot \frac{MaxSize}{MinSize})$ instead of just $O(n)$. Fortunately, this can easily be addressed by making an array of $bestFP$ and $bestOffset$ that is $\lceil \frac{MaxSize}{MinSize} \rceil$ elements in size. This array can preserve rolling hashes, to be carried over to subsequent rounds. Using this approach I managed to end up with an implementation with an amortized worst-case running time of $O(n)$. Its throughput is nearly indistinguishable from plain FastCDC8KB, due to the hot path still being the rolling hash computation, which is unaltered.
With regards to performance of the chunking performed, I downloaded some different versions of the Linux kernel source code. Because the upstream tarballs contain timestamps, I unpacked them and concatenated all of the files contained within, which gave me some ~1.4 GB files, consisting mostly of text. I cut these files into ~10 KB chunks using both FastCDC8KB and the algorithm described above, giving me ~140k chunks per kernel. Comparing Linux 6.7.10 with Linux 6.8.1, I see that:
- Using FastCDC8KB, 12.29% of the chunks change.
- Using the algorithm above, only 11.44% of the chunks change.
This means that the algorithm described above performs $1 - 11.44\% / 12.29\% = 6.92\%$ better at eliminating redundancy for this specific dataset. Also using a uniform distribution in chunk size makes it possible to reduce the maximum chunk size significantly. Tuning it to fit your needs should also be easier, as you now only need to adjust $MinSize$ and $MaxSize$. $AvgSize$ and the bit masks are no longer needed.
Would you by any chance be interested in trying to reproduce these results? I'd be interested in knowing whether these savings hold in general.
@EdSchouten what do you think would be the next step here for this PR?
To me, it seems like we have established the value of using FastCDC as one of the potential chunking algorithms. FastCDC comes with several configuration knobs that could be finetuned against the data to improve the efficiency of runtime and deduplication hits.
I think this means that we gonna need a mechanism for the server to advertise the desired FastCDC config, and the client to comply accordingly.
As for the modified-FastCDC, if we cannot expose it as an discoverable configuration knobs, then we could add a new chunking algorithm after FastCDC is merged.
WDYT?
I have no opinion whatsoever what should happen here. As I mentioned during the last working group meeting, I have absolutely no intent to implement any of this on the Buildbarn side as part of REv2. I don't think that there is an elegant way we can get this into the existing protocol without making unreasonable sacrifices. For example, I care about data integrity. So with regards to what the next steps are, that's for others within the working group to decide.
That said, I am more than willing to engage in more discussions how we should address this as part of REv3. First and foremost, I think that the methodology that is used to chunk object should not be part of the lower level storage. Files should be Merkle trees that are stored in the CAS in literal form. What methodology is used to chunk files should only need to be specified by clients to ensure that workers chunk files in a way that is consistent with locally created files. Therefore, the policy to chunk should in REv3 most likely be stored in its equivalent of the Command message. Not in any of the capabilities.
FYI: If other people want to do some testing in this area, I have just released the source code for the algorithm described above: https://github.com/buildbarn/go-cdc
Hey @roloffs ,
We had a monthly REAPI meeting this week and the maintainers have concluded that we should push this PR forward. As folks discussed on the current state of this PR, notable blockers that need some additional work:
-
Clarification on which chunking algorithm should be included and why they should be included. In the specific case of CDC, the maintainers have asked whether we could provide a "sane default" for the set of tuning involved in setting it up.
-
A test vector should be provided for each chunking algorithm. This should help folks verify their implementations against the spec more easily.
-
Some document improvement is needed. Specifically, the maintainers suggested that this new API set should be marked as an "optional extension" of REAPI v2. The purpose is to highlight that this is a temporary solution for the large blobs problem for REAPI v2. I also suspect using the word "Experimental" here would help lower the expectations from the end users and help us merge this PR earlier. The maintainers hope that REAPI v3 will provide a more concrete solution to this problem.
With that said, please let me know if you still have the capacity to work on this PR. If not, I could give it a try in a few weeks to see if I could help drive it to the finish line.
cc: @buchgr @EdSchouten
Hello @sluongng, sorry for not being responsive for a longer time, I was on parental leave from work for three months and will catch up everything during the next days. I am willing to finish this PR and also have the capacity to do this from now on. Still, if you are willing to support, it would be appreciated since I have to consider and incorporate all great comments from @EdSchouten. Today, there is a Remote Execution API Working Group Meeting, however, I won't attend since there is not much to report. I will do my best to finish everything until the next meeting in August.