attic Deduplication Strategies

Deduplication Strategies

Open blaggacao opened this issue 1 year ago • 3 comments

Reference: https://github.com/NixOS/nixpkgs/issues/89380

As proposed there by @wamserma and also recently by @wmertens in discourse.

Edit: further references of this idea in next comment down below.

Before casync or any generic dedup strategy

... we can likely exploit the (presumed & yet unmetered) property that a lot of re -builds may not actually change a lot of bytes, but the store path references to other dependencies, and thus creating an almost identical new store entry.

Possible Solution

When zeroing (/nix/store/0000000...) these store references creates identical artifacts, we have a potent initial and cheap deduplication exclusive to store based systems.

The true references would need storing via some metadata schema and cheap substitution on-the-fly when serving a given concrete store entry.

Quantification Necessary

The purpose of this issue is to anchor this strategy in the context of attic, since the roadmap states potential future use of deduplication strategies, specifically mentioning casync.

Unfortunately, I can't provide any estimates or quantification of the benefits other than the qualitative argument that this can be an interesting and cheap approach to explore before more advanced dedup strategies may be considered.

Still, I think it is worth mentioning in this context.

Jan 02 '23 01:01 blaggacao

Further references

The spongix nix cache implementation already uses desync (which has an interesting readme!), a casync implementation in go.

For interested parties, here is the introduction blog post for casync, which explains the algorithm: https://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html

And the really deep dive is here: https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/

Stripping of store references has also been discussed in the context of nix-casync.

And the introductory blog post is also a good entrypoint from a Nix perspective.

Jan 02 '23 01:01 blaggacao

FastCDC-based chunking has been added in e8f9f3c04b3b13ac6cde6a4fad25c9cd1ff7198e [^1]. In this new model, NARs are backed by a sequence of content-addressed chunks in the Global Chunk Store. Newly-uploaded NARs will be split into chunks with FastCDC and only new chunks will be uploaded to the storage backend. NARs that have existed prior to chunking will be converted to have a single chunk.

For example, this works reasonably well for huge unfree paths that are rebuilt without version change (e.g., Zoom and VSCode), even with a very large chunk size. I have some simple numbers on here and will post more later. It also works even if there are some differences (e.g., llvm-13.0.0-lib -> llvm-13.0.1-lib). The default uses 128 KiB as the average chunk size, and I will add atticadm test-chunking so chunk sizes can be easily fine-tuned.

I'm leaving this issue open so other approaches can be explored.

Some relevant FAQs:

Why chunk NARs instead of individual files?

In the current design, chunking is applied to the entire uncompressed NAR file instead of individual constituent files in the NAR. Big NARs that benefit the most from chunk-based deduplication (e.g., VSCode, Zoom) often have hundreds or thousands of small files. During NAR reassembly, it's often uneconomical or impractical to fetch thousands of files to reconstruct the NAR in a scalable way. By chunking the entire NAR, it's possible to configure the average chunk size to a larger value, ignoring file boundaries and lumping small files together. This is also the approach casync has taken.

You may have heard that the Tvix store protocol chunks individual files instead of the NAR. The design of Attic is driven by the desire to effectively utilize existing platforms with practical limitations[^2], while looking forward to the future.

Why not just erase store path references?

At first glance, erasing store path references and storing them separately seems easy but it's actually difficult to do in practice. It makes NAR assembly difficult (not a simple concatenation of chunks anymore) and the immediate benefits from doing so alone are minimal. Many files have other differences, like a minor version upgrade or .note.gnu.build-id. The files that this approach works with (mostly small paths like configurations) don't have much overhead to begin with. Therefore I opted for the approach that does provide considerable gains and is simple to implement to start with.

What happens if a chunk is corrupt/missing?

When a chunk is deleted from the database, all dependent .narinfo and .nar will become unavailable (503). However, this can be recovered from automatically when any NAR containing the chunk is uploaded.

At the moment, Attic cannot automatically detect when a chunk is corrupt or missing. Correctly distinguishing between transient and persistent failures is difficult. The atticadm utility will have the functionality to kill/delete bad chunks.

[^1]: Apologies for the big code dump commit - the commit history is pretty chaotic in the private WIP branch. It basically remodelled the entire storage model. [^2]: In more concrete terms, I want to use Cloudflare Workers to reassemble NARs for the sweet, sweet free egress :smiley:

Jan 15 '23 22:01 zhaofengli

Wouldn't be simpler to put each store reference into it's own chunk? Doing so would only add complexity to the chunker but not to the assembler.

Dec 17 '23 13:12 shimunn

attic attic copied to clipboard

Deduplication Strategies

Why chunk NARs instead of individual files?

Why not just erase store path references?

What happens if a chunk is corrupt/missing?

attic
attic copied to clipboard