nativelink icon indicating copy to clipboard operation
nativelink copied to clipboard

Use ultracdc instead of fastcdc

Open leodziki opened this issue 1 year ago • 13 comments

UltraCDC outperforms FastCDC in speed and deduplication

Description

UltraCDC offers superior deduplication efficiency by optimizing chunk boundaries, which reduces storage needs. It provides faster performance compared to FastCDC, thanks to improvements in processing speed. Additionally, its adaptability to various data types and efficient chunk size selection enhance its scalability and overall effectiveness.

Fixes https://github.com/TraceMachina/nativelink/issues/692

Type of change

  • [x] New feature (non-breaking change which adds functionality)
  • [x] This change requires a documentation update

How Has This Been Tested?

cargo test --all passes as expected; bazel test //... succeeds as well.

Checklist

  • [x] Updated documentation if needed
  • [x] Tests added/amended
  • [x] bazel test //... passes locally
  • [ ] PR is contained in a single commit, using git amend see some docs

This change is Reviewable

leodziki avatar Aug 29 '24 08:08 leodziki

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Aug 29 '24 08:08 CLAassistant

  1. Why the Rename? The rename from fastcdc to ultracdc reflect enhancements in speed by implementing ultracdc core algorithm.

  2. Why change the rust library this is based on to a go library? There is no rust library related to ultracdc. There is an interesting algorithm called UltraCDC from some of the FastCDC authors that seems worth investigating: https://github.com/PlakarLabs/go-cdc-chunkers/blob/main/chunkers/ultracdc/ultracdc.go After careful analysis, this approach has been translated into a Rust-based library.

  3. What performance is this improving? Efficiency and Speed: The transition could lead to performance improvements in terms of processing speed, memory usage, or scalability Optimization: ultracdc could include optimizations that weren't available or possible in fastcdc, such as better chunking algorithms, more efficient data handling, or improvements in how data is processed and transferred.

leodziki avatar Aug 29 '24 19:08 leodziki

Hope to review the implementation of UltraCDC algorithm approach soon. cc: @allada, @MarcusSorealheis

leodziki avatar Aug 31 '24 15:08 leodziki

Removed fastcdc.rs to better compare differences with the newly implemented ultracdc algorithm in this PR. The main difference in this UltraCDC implementation is the use of a rolling hash function to determine chunk boundaries. The rolling hash value is compared against a dynamically selected mask to decide when to split the chunk. This approach allows for more flexible and potentially more accurate chunking compared to the simpler hash-based method used in FastCDC. cc: @allada, @MarcusSorealheis

leodziki avatar Sep 02 '24 12:09 leodziki

nativelink-util/src/ultracdc.rs line 84 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: This should take &[u8] instead of BytesMut.

Done

leodziki avatar Sep 07 '24 20:09 leodziki

nativelink-util/src/ultracdc.rs line 88 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

it looks like this is vulnerable to overflows. Can we instead use one of the safe functions like .saturating_add, .overflowing_add or similar.

Done.

leodziki avatar Sep 07 '24 20:09 leodziki

nativelink-util/src/ultracdc.rs line 115 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

this appears to be unused now.

Done.

leodziki avatar Sep 07 '24 20:09 leodziki

nativelink-util/src/ultracdc.rs line 74 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Make these u64 and do the casting here so we don't pay for casting below.

Done

leodziki avatar Sep 08 '24 06:09 leodziki

nativelink-util/src/ultracdc.rs line 129 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: use a slice instead, so we don't need to copy the data.

Done

leodziki avatar Sep 08 '24 06:09 leodziki

nativelink-util/src/ultracdc.rs line 113 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: We don't want to copy out data, instead use the .split_* functions to create the dynamic slices and return them.

Done

leodziki avatar Sep 08 '24 07:09 leodziki

Previously, allada (Nathan (Blaise) Bruer) wrote…

I'm concerned because this is appears to be significantly slower now (but have not measured it). We are now computing a rolling hash on every byte making the algorithm now O(n * m) where n is number of bytes and m is window_size. This function is already quite slow. I'm not sure it is a good idea to make it significantly slower.

If we remove the copies and make it so the rolling hash can be updated instead of recomputed, it can probably be accepted.

I understand your concern about the potential performance impact of the current implementation. You're right that computing the rolling hash for every byte could lead to a time complexity of O(n m), where n is the number of bytes and m is the window size.

To optimize this, we can maintain the rolling hash in a way that only updates it when necessary, rather than recalculating it for every byte. This can be achieved by adjusting the hash incrementally as bytes are added and removed from the window, ensuring that we only perform constant-time updates.

leodziki avatar Sep 08 '24 07:09 leodziki

I hope to review the revision based on the feedback. cc: @allada, @MarcusSorealheis

leodziki avatar Sep 08 '24 07:09 leodziki

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 9 files reviewed, and 4 discussions need to be resolved

a discussion (no related file): I'm still really failing to see why to change this. From what I can tell this is adding a significant amount of overhead to the existing algorithm to only gain the ability for a dynamically selected mask. The existing algorithm is a rolling checksum. Can you perform some measurements to show it performing better on any benchmark along with speed?

Thank you for sharing your thoughts. I understand your concerns about the potential overhead introduced by the changes, especially in the context of the existing rolling checksum algorithm. I will conduct some benchmarks to compare the performance of the current implementation against the proposed changes, focusing on both speed and efficiency. This data will help us make a more informed decision moving forward.

leodziki avatar Sep 17 '24 06:09 leodziki

I will conduct some benchmarks to compare the performance of the current implementation against the proposed changes, focusing on both speed and efficiency. This data will help us make a more informed decision moving forward.

Were you ever able to get the benchmark data for this?

palfrey avatar Oct 07 '25 11:10 palfrey