lakeFS
lakeFS copied to clipboard
Merge optimization by fast forwarding ranges - Step 2
The merge operation receives two metaranges source-metarange
destination-metarange
and finds the base-metarange
.
The merge operation creates a new metarange which is done by a three-way diff.
The merge operation is currently optimized in a few ways ( showed in the example ) by writing ranges without the need to read and create new ranges
The previous optimization (#2710 ) handles only cases where all three ranges ( source, destination, and base ) have the exact same bounds.
This optimization should also support the case of same bounds with multiple ranges.
Example
Let's consider the following metaranges:
base-metarange
range_id | min_key | max_key | count |
---|---|---|---|
1 | path_a/ | path_a/z | 50 |
2 | path_b/ | path_b/z | 50 |
3 | path_c/ | path_c/z | 50 |
7 | path_d/ | path_d/z | 50 |
destination-metarange
range_id | min_key | max_key | count |
---|---|---|---|
1 | path_a/ | path_a/z | 50 |
4 | path_b/ | path_b/z | 50 |
3 | path_c/ | path_c/z | 50 |
8 | path_d/ | path_d/e | 30 |
9 | path_d/ | path_d/z | 30 |
source-metarange
range_id | min_key | max_key | count |
---|---|---|---|
1 | path_a/ | path_a/z | 50 |
5 | path_b/ | path_b/z | 55 |
6 | path_c/ | path_c/z | 55 |
7 | path_d/ | path_d/z | 50 |
10 | path_f/ | path_g/z | 55 |
When creating the merge-metarange
- The merge operation will write
range 1
tomerge-metarange
without the need to read and compare - The merge operation will read
range 2
,range 4
andrange 5
and create a new range (or ranges) containing the merged objects - The merge operation will write
range6
tomerge-metarange
without the need to read and compare ( fast forward optimization step 1) - The merge operation will read
range 7
,range 8
, andrange9
and create new ranges containing the merged objects - The merge operation will write
range10
to themerge-metarange
.
Step 1 ,2,3 and 5 are optimized.
Suggestion
Currently, the optimization works only on single ranges (i.e one for source
, one for destination
, and one for base
)
This step should also handle the case of same bounds for a few consecutive ranges.
In the example, we would optimize step 4 by writing range 8
and range 9
without the need to read the ranges.
Note - this step should only optimize the case that one side (source or destination) is a single range and the other is a set of consecutive ranges with the same bounds. The case of consecutive ranges on both sides will not be handled as part of this issue.