nessie
nessie copied to clipboard
Content aware merge operations
Merge operations in Nessie "only" copy one or more commits from one reference onto another, since the common ancestor. Nessie itself does not interpret the meaning of the contents in the commits. While the Nessie merge operation is technically correct and works as designed, they prevent multiple, "nested" merge operations.
Example:
CREATE TABLE foo...;
-- User 1
CREATE BRANCH branch_a;
INSERT INTO foo ('abc');
-- User 2
CREATE BRANCH branch_b;
INSERT INTO foo ('def');
-- User 1
MERGE branch_a INTO main;
SELECT * FROM foo ... ; -- returns 'abc'
-- User 2
MERGE branch_b INTO main; -- CONFLICT
(Note: the above behavior is true for all Nessie versions)
I think, we have to have a "Nessie aware merge operation" in Iceberg itself, that properly
- cherry-picks the Iceberg snapshots to be merged onto the target branch
- updates the current schema in the target branch, even if the source reference does not update any data
From some investigation, Iceberg already contains the code for the building blocks:
-
SchemaUpdate.unionByNameWith()
can merge twoSchema
objects -
SnapshotManager.cherrypick(long snapshotId)
to cherry-pick one snapshot -
Schema.sameSchema()
can compare two schema objects (semantically equivalent)
What's missing:
- Functionality to cherry-pick snapshots from a different Iceberg
Table
- Functionality to perform all the schema-updates and cherry-picks in a single commit (like a single call to
PendingUpdate.commit()
leading to a single Nessie commit)
Unclear, whether we have to cherry-pick all Iceberg snapshots since the common ancestor or whether it's sufficient to just cherry-pick the most recent Iceberg snapshot (and the recent schema). Technically, the most recent Iceberg snapshot (and current schema) should be sufficient. But without the intermediate snapshots the change history provided by the Iceberg snapshots would be lost or become incomplete.
Not sure if SnapshotManager.cherrypick(long snapshotId)
already tackles it: the "snapshot log" in TableMetadata
must stay consistent.
I also think, that the functionality to do the above is not purely related to Nessie - it does not even have to touch Nessie code in Iceberg. It is strictly speaking "just" Iceberg functionality that produces a new TableMetadata
, which then gets commited via the NessieTableOperations
.
We can probably implement it as an Iceberg procedure next to CherrypickSnapshotProcedure
for Spark 3.x
The same mechanism should also be done for Deltalake, but better as a separate issue / PR.
Since we are talking about intelligent merge operations / content manipulation, would it make sense to support multiple parents in Nessie commits (like in git)?
With a content-aware merge, I guess the contents on the base branch may have non-trivial differences from both old base contents and the contents being merged. Therefore, it might be valuable to preserve the lineage of changes (unless the merge is fast-forward).
Few observations -
- The merge isn't handled correctly if any of the following operations happen on the target branch, after the fork ->a. update, b. delete, c. rewrites (compaction/sorting), d. partition spec changes. May be we should put guardrails to avoid these situations. IMO, (d) it should be possible to extend the logic to handle partition spec changes.
- For transplant, the same conditions would apply even in the source branch.
Unclear, whether we have to cherry-pick all Iceberg snapshots since the common ancestor or whether it's sufficient to just cherry-pick the most recent Iceberg snapshot (and the recent schema). Technically, the most recent Iceberg snapshot (and current schema) should be sufficient. But without the intermediate snapshots the change history provided by the Iceberg snapshots would be lost or become incomplete.
SnapshotManager.cherrypick()
handles this via delta added/deleted data files; but ignores existing data files assuming they're unchanged. For merge case, we'll need an aggregated view of added/deleted data files from all the snapshots from the point of fork. The two approaches are to cherry pick one by one and to aggregate and merge in a single snapshot. In DeltaLake too, each commit log file maintains only the delta.
Some thoughts in favour of cherry-picking each commit -
- Within Nessie merge, we create a copy for each commit to the target branch. The content linked to the copied merge commit should ideally represent a merged snapshot at each hash.
- Aggregation of added/deleted files across snapshots would require more memory as the deleted files have to be stored to match for interim cancellations.
- Some level of history will be retained.
- Aggregation has to be written separately for both, DeltaLake as well as Iceberg.
Cons of cherry-picking each commit -
- Merge will become a multi-step operation e2e (starting from iceberg client).
- It'll create more commits.
#6631 adds the Nessie side support for this