delta icon indicating copy to clipboard operation
delta copied to clipboard

[Spark] Automatically add legacy features to table when setting legacy protocol versions

Open andreaschat-db opened this issue 10 months ago • 5 comments

Which Delta project/connector is this regarding?

  • [x] Spark
  • [ ] Standalone
  • [ ] Flink
  • [ ] Kernel
  • [ ] Other (fill in here)

Description

Currently, dropping the last feature from a table results in downgrading the table's protocol to the minimum required legacy versions. However, this can only happen if the legacy features of the table match exactly a legacy protocol version pair. This rule is placed to avoid accidentally upgrading a table by implicitly adding extra legacy features during downgrade. This might disallow version downgrade in tables created directly with table features.

For example, consider creating a table with Protocol(3, 7, RemovableReaderWriterFeature, ChangeDataFeed) and then dropping RemovableReaderWriterFeature. The resulting protocol will be Protocol(3, 7, ChangeDataFeed) instead of Protocol(1, 4). This is because (1, 4) includes legacy features not defined in the original protocol. If the user wishes to downgrade this table to legacy versions needs to set the legacy versions before downgrading the feature. For example:

CREATE TABLE x TBLPROPERTIES (
    delta.feature.RemovableReaderWriterFeature = 'supported',
    delta.feature.ChangeDataFeedTableFeature = 'supported')
ALTER TABLE x SET TBLPROPERTIES (
  'delta.minReaderVersion' = 1,
  'delta.minWriterVersion' = 4) 
ALTER TABLE x DROP FEATURE RemovableReaderWriterFeature

Now we also allow setting legacy protocol versions to a features table with only legacy features. That can help in scenarios where the user dropped the last table feature but did not set the legacy protocol versions in advance. For example:

CREATE TABLE x TBLPROPERTIES (
    delta.feature.RemovableReaderWriterFeature = 'supported',
    delta.feature.ChangeDataFeedTableFeature = 'supported')
// After this command the table will have Protocol(3, 7, ChangeDataFeedTableFeature).
ALTER TABLE x DROP FEATURE RemovableReaderWriterFeature
// After this command the table will have Protocol(1, 4).
ALTER TABLE x SET TBLPROPERTIES (
  'delta.minReaderVersion' = 1,
  'delta.minWriterVersion' = 4) 

Furthermore, we also relax the requirements so the user does not have to set the right versions. When setting any legacy protocol versions to a table features protocol, we automatically add any missing legacy features given the legacy features that already present in the protocol.

How was this patch tested?

Added new tests to DeltaProtocolVersionSuite.

Does this PR introduce any user-facing changes?

Yes. The user can now set legacy versions to a table with features to automatically add all relevant legacy features. Example provided in the description of the PR.

andreaschat-db avatar Apr 04 '24 12:04 andreaschat-db

One question. Consider the following command sequence:

CREATE TABLE x TBLPROPERTIES (
    delta.feature.RemovableReaderWriterFeature = 'supported',
    delta.feature.ChangeDataFeedTableFeature = 'supported')
ALTER TABLE x DROP FEATURE RemovableReaderWriterFeature
ALTER TABLE x SET TBLPROPERTIES (
  'delta.minReaderVersion' = 1,
  'delta.minWriterVersion' = 4) 

Should the table get (1,4) in the end? I think it should as we automatically add all legacy features in the 3rd command.

xupefei avatar Apr 04 '24 13:04 xupefei

One question. Consider the following command sequence:

CREATE TABLE x TBLPROPERTIES (
    delta.feature.RemovableReaderWriterFeature = 'supported',
    delta.feature.ChangeDataFeedTableFeature = 'supported')
ALTER TABLE x DROP FEATURE RemovableReaderWriterFeature
ALTER TABLE x SET TBLPROPERTIES (
  'delta.minReaderVersion' = 1,
  'delta.minWriterVersion' = 4) 

Should the table get (1,4) in the end? I think it should as we automatically add all legacy features in the 3rd command.

Very good point. Command 2 will result to Protocol(3, 7, ChangeDataFeedTableFeature). Then, command 3 will result to Protocol(3, 7, ChangeDataFeedTableFeature + rest features in (1, 4)). Dropping one of the existing legacy features cannot result in (1, 4) because the legacy features list won't match exactly. To get out of this and downgrade to (1, 4) the user will have to add a feature to the table and then drop it... I could address this in a separate PR in the future since there is a way out of it and it is a rare case, i.e., downgrade the protocol versions of a table with NO table features but with table feature versions.

andreaschat-db avatar Apr 04 '24 13:04 andreaschat-db

Asking the user to set delta.minReaderVersion and delta.minWriterVersion is not user-friendly. This means users need to understand how to map a feature to minReaderVersion/minWriterVersion.

zsxwing avatar Apr 05 '24 04:04 zsxwing

Asking the user to set delta.minReaderVersion and delta.minWriterVersion is not user-friendly. This means users need to understand how to map a feature to minReaderVersion/minWriterVersion.

True...but it's even worse if they have to do that and add each feature individually. At least this way they can go "ok, my connector supports (x, y), let me set this on the table and then downgrade.

larsk-db avatar Apr 05 '24 09:04 larsk-db

For me it seems the existing behavior you described is better:

“For example, consider creating a table with Protocol(3, 7, RemovableReaderWriterFeature, ChangeDataFeed) and then dropping RemovableReaderWriterFeature. The resulting protocol will be Protocol(3, 7, ChangeDataFeed) instead of Protocol(1, 4)”

Protocol 3,7 besides being newer has less requirements. Meaning if you “downgrade” to 1,4 you are making your table less compatible to clients that doesn’t support all the features from writer v4, but supports v7 + CDC

felipepessoto avatar Apr 07 '24 03:04 felipepessoto

There are spark master test failures:

https://github.com/delta-io/delta/actions/runs/9891350186/job/27321586977?pr=2848

[info] CreateCheckpointSuite:
[info] - commits containing adds and removes, and no previous checkpoint
[info] - commits containing adds, and no previous checkpoint
[info] - commits containing adds and removes, and a previous checkpoint created using Spark (actions/perfile): 1000000
[info] - commits containing adds and removes, and a previous checkpoint created using Spark (actions/perfile): 3
[info] - commits containing adds, and a previous checkpoint created using Spark (actions/perfile): 1000000
[info] - commits containing adds, and a previous checkpoint created using Spark (actions/perfile): 3
[info] - commits with metadata updates
[info] - commits with protocol updates *** FAILED ***
[info]   == Results ==
[info]   
[info]   == Expected Answer - 1 ==
[info]   ([2,2])
[info]   
[info]   == Result - 1 ==
[info]   ([1,2]) (TestUtils.scala:386)

scottsand-db avatar Jul 11 '24 16:07 scottsand-db

There are spark master test failures:

https://github.com/delta-io/delta/actions/runs/9891350186/job/27321586977?pr=2848

[info] CreateCheckpointSuite:
[info] - commits containing adds and removes, and no previous checkpoint
[info] - commits containing adds, and no previous checkpoint
[info] - commits containing adds and removes, and a previous checkpoint created using Spark (actions/perfile): 1000000
[info] - commits containing adds and removes, and a previous checkpoint created using Spark (actions/perfile): 3
[info] - commits containing adds, and a previous checkpoint created using Spark (actions/perfile): 1000000
[info] - commits containing adds, and a previous checkpoint created using Spark (actions/perfile): 3
[info] - commits with metadata updates
[info] - commits with protocol updates *** FAILED ***
[info]   == Results ==
[info]   
[info]   == Expected Answer - 1 ==
[info]   ([2,2])
[info]   
[info]   == Result - 1 ==
[info]   ([1,2]) (TestUtils.scala:386)

The PR is not ready to be merged. The particular failure is fixed by https://github.com/delta-io/delta/pull/3356.

andreaschat-db avatar Jul 11 '24 17:07 andreaschat-db