[BUG] Fix SQL API option/properties metadata configuration
Wait for https://github.com/delta-io/delta/issues/1182 first. This PR needs to use the same feature flag that is created for that issue.
Bug
Describe the problem
There are various inconsistencies in the SQL API when writing metadata using OPTIONS and TBLPROPERTIES.
Specifically
Case 1
when using TBLPROPERTIES, we incorrectly write properties that do not start with .delta. This happens using any of CREATE TABLE, REPLACE TABLE, CREATE OR REPLACE
Case 2
when using CREATE TABLE (not using AS SELECT) using path and using OPTIONS
- we incorrectly write options that do not start with
.delta - we incorrectly duplicate options with
option.$key
Steps to reproduce
Case 1
CREATE TABLE tbl (id INT) USING DELTA
TBLPROPERTIES('logRetentionDuration'='interval 60 days', 'delta.checkpointInterval'=20)
Case 2
CREATE TABLE tbl (id INT) USING DELTA
OPTIONS('dataSkippingNumIndexedCols'=33,'delta.deletedFileRetentionDuration'='interval 2 weeks')
LOCATION '/private/var/folders/mv/gj5n7hvn78n7td5pp1c_jgjh0000gp/T/spark-69b43bf6-b143-459f-a05b-751bdbf4308a'
Observed results
Case 1
The following is written out as table metadata
logRetentionDuration -> interval 60 days
delta.checkpointInterval -> 20
however what we really want to be written out is
delta.checkpointInterval -> 20
Case 2
The following is written out as table metadata
delta.deletedFileRetentionDuration -> interval 2 weeks
dataSkippingNumIndexedCols -> 33
option.delta.deletedFileRetentionDuration -> interval 2 weeks
option.dataSkippingNumIndexedCols -> 33
however what we really want to be written out is
delta.deletedFileRetentionDuration -> interval 2 weeks
Implementation Requirement
After you fix this issue, please update the test and test output table here: https://github.com/delta-io/delta/blob/master/core/src/test/scala/org/apache/spark/sql/delta/DeltaWriteConfigsSuite.scala#L316
Also, please add tests with the feature from #1182 flag enabled/disabled, as well as by reading an "older" table with these invalid delta properties (this should go into EvolvabilitySuite).
Hi!
I'm interested in implementing a fix for this bug. To be sure I understand the issue correctly, the result has to be a snapshot object that the metadata configuration only contains the parameters with delta prefix, the rest, nonprefixed and option.delta.{...} should be removed.
I've already started in a branch
@alfonsorr - you can read the table metadata by doing deltaLog.snapshot.metadata.configuration.
So, we expect the result (i.e. the table metadata) after this fix to look as is described in the issue description above.
e.g. for case 1, currently deltaLog.snapshot.metadata.configuration is returning
logRetentionDuration -> interval 60 days
delta.checkpointInterval -> 20
but instead we want it to only return
delta.checkpointInterval -> 20
Let me know if you have any more questions!
BTW, a similar fix for a similar problem has already been done. See https://github.com/delta-io/delta/issues/1182 and https://github.com/delta-io/delta/pull/1254
One thing, for parameters that are not delta related but are correct like in the test in the code:
ALTER TABLE $tableName SET TBLPROPERTIES ('comment'='does it work?')
What should be the behavior, I understand that the comment has to be added to the metadata, but what about deltaLog.snapshot.metadata.configuration
I just noticed that the description is not accurate. For example, today we will write
delta.deletedFileRetentionDuration -> interval 2 weeks
dataSkippingNumIndexedCols -> 33
option.delta.deletedFileRetentionDuration -> interval 2 weeks
option.dataSkippingNumIndexedCols -> 33
to deltaLog.snapshot.metadata.configuration.
And we want to write only the following paris
delta.deletedFileRetentionDuration -> interval 2 weeks
dataSkippingNumIndexedCols -> 33
That makes sense! so it's only related to those that start with option.
so it's only related to those that start with
option.
Correct!