nydus icon indicating copy to clipboard operation
nydus copied to clipboard

Support runtime chunk deduplication

Open jiangliu opened this issue 2 years ago • 4 comments

Details

This PR enhances nydusd to support runtime chunk deduplication. It works in this way:

  1. Use a sqlite database to record information about decompressed/plaintext chunks available on local node.
  2. When a chunk is not ready in the uncompressed data blob file, query the sqlite database whether a chunk with the same chunk digest is available. If a chunk with the same chunk digest exists, copy the decompressed from the source data blob file to the target data blob by using copy_file_range().
  3. Otherwise download the compressed chunk from remote, uncompress it and write to the target data blob, and add a record for the chunk to the database.

So there are two types of chunk deduplication:

  1. saving network bandwidth when the chunk is available on local node, because we don't need to download compressed chunk data from remote.
  2. saving local disk space if the underlying filesystem supports reference. If the filesystem storing data blob files supports reference, copy_file_range() will optimize to use reference instead of data copy, thus reduce local storage consuption.

Types of changes

What types of changes does your PullRequest introduce? Put an x in all the boxes that apply:

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Documentation Update (if none of the other choices apply)

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • [x] I have updated the documentation accordingly.
  • [x] I have added tests to cover my changes.

jiangliu avatar Dec 07 '23 02:12 jiangliu

Codecov Report

Merging #1507 (7d287c9) into master (06755fe) will increase coverage by 0.02%. The diff coverage is 66.51%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1507      +/-   ##
==========================================
+ Coverage   62.72%   62.74%   +0.02%     
==========================================
  Files         129      129              
  Lines       44153    44360     +207     
  Branches    44153    44360     +207     
==========================================
+ Hits        27695    27834     +139     
- Misses      15087    15144      +57     
- Partials     1371     1382      +11     
Files Coverage Δ
storage/src/cache/dedup/db.rs 79.09% <100.00%> (+0.08%) :arrow_up:
storage/src/cache/mod.rs 57.84% <ø> (ø)
utils/src/digest.rs 91.53% <0.00%> (-0.53%) :arrow_down:
storage/src/cache/filecache/mod.rs 67.58% <66.66%> (+0.08%) :arrow_up:
storage/src/cache/fscache/mod.rs 75.92% <63.63%> (-0.47%) :arrow_down:
storage/src/utils.rs 93.59% <78.94%> (-2.30%) :arrow_down:
storage/src/cache/cachedfile.rs 33.14% <0.00%> (-0.44%) :arrow_down:
src/bin/nydusd/main.rs 0.18% <0.00%> (-0.01%) :arrow_down:
storage/src/cache/dedup/mod.rs 72.72% <82.75%> (+72.72%) :arrow_up:

... and 1 file with indirect coverage changes

codecov[bot] avatar Dec 07 '23 05:12 codecov[bot]

Hi all, I tried out this feature and it seems to work as expected. Is there something preventing it from being merged?

jonoirwinrsa avatar May 28 '24 18:05 jonoirwinrsa

Hi all, I tried out this feature and it seems to work as expected. Is there something preventing it from being merged?

cc @jiangliu any updates we can continue? :)

imeoer avatar May 29 '24 01:05 imeoer