lakeFS
lakeFS copied to clipboard
Merge optimization by fast forwarding ranges - Step 2
The merge operation receives two metaranges source-metarange destination-metarange and finds the base-metarange.
The merge operation creates a new metarange which is done by a three-way diff.
The merge operation is currently optimized in a few ways ( showed in the example ) by writing ranges without the need to read and create new ranges
The previous optimization (#2710 ) handles only cases where all three ranges ( source, destination, and base ) have the exact same bounds.
This optimization should also support the case of same bounds with multiple ranges.
Example
Let's consider the following metaranges:
base-metarange
| range_id | min_key | max_key | count |
|---|---|---|---|
| 1 | path_a/ | path_a/z | 50 |
| 2 | path_b/ | path_b/z | 50 |
| 3 | path_c/ | path_c/z | 50 |
| 7 | path_d/ | path_d/z | 50 |
destination-metarange
| range_id | min_key | max_key | count |
|---|---|---|---|
| 1 | path_a/ | path_a/z | 50 |
| 4 | path_b/ | path_b/z | 50 |
| 3 | path_c/ | path_c/z | 50 |
| 8 | path_d/ | path_d/e | 30 |
| 9 | path_d/ | path_d/z | 30 |
source-metarange
| range_id | min_key | max_key | count |
|---|---|---|---|
| 1 | path_a/ | path_a/z | 50 |
| 5 | path_b/ | path_b/z | 55 |
| 6 | path_c/ | path_c/z | 55 |
| 7 | path_d/ | path_d/z | 50 |
| 10 | path_f/ | path_g/z | 55 |
When creating the merge-metarange
- The merge operation will write
range 1tomerge-metarangewithout the need to read and compare - The merge operation will read
range 2,range 4andrange 5and create a new range (or ranges) containing the merged objects - The merge operation will write
range6tomerge-metarangewithout the need to read and compare ( fast forward optimization step 1) - The merge operation will read
range 7,range 8, andrange9and create new ranges containing the merged objects - The merge operation will write
range10to themerge-metarange.
Step 1 ,2,3 and 5 are optimized.
Suggestion
Currently, the optimization works only on single ranges (i.e one for source, one for destination, and one for base )
This step should also handle the case of same bounds for a few consecutive ranges.
In the example, we would optimize step 4 by writing range 8 and range 9 without the need to read the ranges.
Note - this step should only optimize the case that one side (source or destination) is a single range and the other is a set of consecutive ranges with the same bounds. The case of consecutive ranges on both sides will not be handled as part of this issue.
After checking on some cases, we should handle the case of consecutive ranges on both sides, we wont benefit much from adding only the first. This optimization should be done together with the case of consecutive ranges on both sides. In order to complete this issue, we should first handle #2784
This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.