lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

Merge optimization by fast forwarding ranges - Step 2

Open guy-har opened this issue 2 years ago • 2 comments

The merge operation receives two metaranges source-metarange destination-metarange and finds the base-metarange. The merge operation creates a new metarange which is done by a three-way diff.

The merge operation is currently optimized in a few ways ( showed in the example ) by writing ranges without the need to read and create new ranges

The previous optimization (#2710 ) handles only cases where all three ranges ( source, destination, and base ) have the exact same bounds.

This optimization should also support the case of same bounds with multiple ranges.

Example

Let's consider the following metaranges:

base-metarange

range_id min_key max_key count
1 path_a/ path_a/z 50
2 path_b/ path_b/z 50
3 path_c/ path_c/z 50
7 path_d/ path_d/z 50

destination-metarange

range_id min_key max_key count
1 path_a/ path_a/z 50
4 path_b/ path_b/z 50
3 path_c/ path_c/z 50
8 path_d/ path_d/e 30
9 path_d/ path_d/z 30

source-metarange

range_id min_key max_key count
1 path_a/ path_a/z 50
5 path_b/ path_b/z 55
6 path_c/ path_c/z 55
7 path_d/ path_d/z 50
10 path_f/ path_g/z 55

When creating the merge-metarange

  1. The merge operation will write range 1 to merge-metarange without the need to read and compare
  2. The merge operation will read range 2, range 4 and range 5 and create a new range (or ranges) containing the merged objects
  3. The merge operation will write range6 to merge-metarange without the need to read and compare ( fast forward optimization step 1)
  4. The merge operation will read range 7, range 8, and range9 and create new ranges containing the merged objects
  5. The merge operation will write range10 to the merge-metarange.

Step 1 ,2,3 and 5 are optimized.

Suggestion

Currently, the optimization works only on single ranges (i.e one for source, one for destination, and one for base ) This step should also handle the case of same bounds for a few consecutive ranges.

In the example, we would optimize step 4 by writing range 8 and range 9 without the need to read the ranges.

Note - this step should only optimize the case that one side (source or destination) is a single range and the other is a set of consecutive ranges with the same bounds. The case of consecutive ranges on both sides will not be handled as part of this issue.

guy-har avatar Dec 14 '21 11:12 guy-har