nativelink icon indicating copy to clipboard operation
nativelink copied to clipboard

Add a delta store (similar to `DedupStore`)

Open barrbrain opened this issue 9 months ago • 10 comments

Inspiration is taken from how git packs achieve high compression rates with good random access performance. The 2 key components are (1) a clustering strategy to store similar objects together and (2) a delta algorithm to encode objects predicted by an existing object. The initial proposed algorithms for analysis are:

  1. TLSH (debian package)
  2. BSDIFF (debian package)

There are crates that implement these algorithms:

  1. tlsh2
  2. qbsdiff

barrbrain avatar May 07 '24 03:05 barrbrain

Hey @barrbrain, thanks for the feedback.

Have you by chance looked at DedupStore?

I believe it is extremely close to what you are suggesting.

allada avatar May 07 '24 18:05 allada

Hi @allada, DedupStore is structurally similar to how I envision a delta store. However, the behavior is somewhat different. While DedupStore is effective for large extents repeated verbatim between objects, a delta store would be effective for large blocks containing many small changes with a consistent pattern. This fits source code changes quite well and binary object changes extremely well. This can boost compression efficiency by 2x~130x for applicable content.

barrbrain avatar May 08 '24 02:05 barrbrain

This type of compression is new to me, but would it look something like this?

  1. Compute TLSH hashes of inputs, store them.
  2. Assume that "similar inputs likely produce similar outputs" and group objects accordingly.
  3. When an action runs on inputs similar to a previous action (maybe it's the same command, just a file was edited), instead of storing outputs as an entirely new file, generate the (binary) diff between the new output and the old one and only store the diff.

This seems like an interesting problem that could potentially reduce stored data significantly or at least improve the time it takes to pass incremental changes around workers (if a worker has an artifact in a local memory store already, it might only need to fetch a diff instead of an entirely new file).

Am I understanding correctly that the dedup store does something very similar, but delta compression is more "specialized" towards incremental builds? If so, it seems intuitive that this kind of compression could outperform more the more general dedup approach.

cc @MarcusSorealheis It might be a very far reach, but would it theoretically be possible to offload similarity computations to a vector database like https://qdrant.tech/ or am I completely looking at the wrong thing here? :sweat_smile:

aaronmondal avatar May 08 '24 22:05 aaronmondal

@aaronmondal Thank you for the insightful link to Qdrant. This highlights the missing component in my proposal, an approximate nearest neighbors (ANN) algorithm. I note that there is an active Rust crate that implements hierarchical navigable small worlds (HNSW), hnsw_rs.

barrbrain avatar May 09 '24 05:05 barrbrain

Only a library is needed @aaronmondal in this case, not a DB. It is a good thought.

MarcusSorealheis avatar May 09 '24 10:05 MarcusSorealheis

This is interesting. I've been thinking a bit about a very similar CS problem we have, in that I want to be better at finding workers that already have most of the assets and run jobs on them instead of LRU/MRU like we use now.

I started playing around with SimHash algorithms to see if they could solve my problem, but found them not quite what I was looking for in this case. I then started looking into KD-trees, but found it to scale horribly and then went on to ANN (approximate nearest neighbor), these worked great, but it felt like "bringing a tank to a knife fight". I eventually settled on Bloom Filter to be the best for this problem.

In the problem you are describing, I feel SimHash and/or ANN might do very well.

allada avatar May 12 '24 19:05 allada

Notes from some offline analysis

I created a corpus to explore this issue by collecting a snapshot of a CAS with DedupStore defaults and no other filters. I built 4 consecutive tags of Chromium for Linux, such that there were a mix of objects shared verbatim between builds, unique objects and similar objects.

To test the quality of hash distance in predicting delta efficiency: for each approximate nearest neighbour, the size of compressed delta and compressed raw object were computed. Separating text and binary objects gives a clearer picture but overall they have similar characteristics. A relatively low distance threshold is sufficient for most of the available compression gains. For this corpus, 9.5% of objects had neighbours within a TLSH distance of 14 and with efficient delta encodings. On average, these deltas were 92% smaller than baseline compression.

Although this only represents a 11.9% improvement in storage density overall, there is a 12-times improvement for incremental changes in content. Edit: updated numbers with BSDIFF4+LZ4 deltas.

barrbrain avatar May 14 '24 15:05 barrbrain

Here is an overview of the relationship between TLSH distance and delta efficiency over baseline compression, for this corpus. The y-axis is log2(delta/baseline). The x axis is TLSH distance between object pairs. TLSH-256-QBSDIFF-LZ4

Edit: Using the above model, I refined the delta selection constraints. For this corpus, 17.8% of objects had neighbours within a TLSH-256-1 distance of 104 and delta encodings at least 4 times as efficient as baseline compression. On average, these deltas were 94.1% smaller than baseline compression. Although this only represents a 15.6% improvement in storage density overall, there is a 16.8-times improvement for incremental changes in content.

barrbrain avatar May 23 '24 16:05 barrbrain

This is very interesting. the RBE working group is actively looking/discussing implementing something very similar to DedupStore as part of the protocol itself: https://github.com/bazelbuild/remote-apis/pull/282

I'll discuss with some of the internal team to see what kind of complexities and gotchas this might introduce.

allada avatar May 23 '24 16:05 allada

In the problem you are describing, I feel SimHash and/or ANN might do very well.

In prior attempts at this issue, I have seen good results with a carefully constructed MinHash.

There is a straightforward reduction of TLSH to a cluster fingerprint: Take a suffix of the code part and for each 2-bit column, reduce to 1-bit predicated on whether the value was in the top quartile.

This yields similar recall performance to HNSW with just a range query on the composite key.

An example from my corpus:

EC408104A021-T165B48E174216BCE0C8A859FE4627C6F005B0C957ACD378FEF6341AD6275ABADE808D47:
5f6e7fcec33b16a636b829d3a4408cf2d01c0404bea9457d7046b5eb22ade7f9-524288
EC408104A021-T12FB48E174216BCE0C8A859FE4627C6F005B0C957ACD378FEF5341AD6275ABADE808E47:
0052291bdaec9ba6429cde4ccfd52f0ba9c1b844ff69f53d861007b1dc770467-524288

Edit: This clustering method has a significant impact on the distribution: TLSH-128-QBSDIFF-LZ4

Applying a similar approach within DedupStore and running the same reference builds, 13.5% of objects in the content store were delta-encoded and there was a 13.3% improvement in storage density overall.

barrbrain avatar Jun 11 '24 20:06 barrbrain

I built a prototype and tested it with daily chromium builds. It was able to achieve an overall ratio of 1:4 on compressible content. Ratios for small objects were 1:15 at best and large objects 1:213 at best. I will propose a set of stores that may be composed to mimic git packing. In combination with DedupStore, they can mimic bup.

barrbrain avatar Oct 23 '24 03:10 barrbrain

Thank you for all the hard work here.

MarcusSorealheis avatar Oct 23 '24 04:10 MarcusSorealheis