[BUG] `columnMapping` readerFeatures is missing when icebergCompatV1(2) is enabled
Bug
Which Delta project/connector is this regarding?
- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)
Describe the problem
Writer Requirements for Column Mapping mentions:
write a protocol action to add the feature columnMapping to both readerFeatures and writerFeatures
However, protocol entry doesn't have columnMapping in readerFeatures field.
Steps to reproduce
- Create a new table with icebergCompatV1 feature:
CREATE TABLE default.test_iceberg
(a INT)
USING DELTA
LOCATION 's3://test-bucket/test_iceberg'
TBLPROPERTIES ('delta.enableIcebergCompatV1'=true);
- Confirm transaction log:
{"commitInfo":{"timestamp":1716541145876,"operation":"CREATE TABLE","operationParameters":{"isManaged":"false","description":null,"partitionBy":"[]","properties":"{\"delta.enableIcebergCompatV1\":\"true\",\"delta.columnMapping.mode\":\"name\",\"delta.columnMapping.maxColumnId\":\"1\"}"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.5.0 Delta-Lake/3.1.0","txnId":"b172aa56-9393-4750-ac30-2c1996497efb"}}
{"metaData":{"id":"8d56ae44-1648-4fd5-a920-81f5f0725ab5","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"a\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{\"delta.columnMapping.id\":1,\"delta.columnMapping.physicalName\":\"col-39727da2-bc31-4d96-a0a9-b3f16113fde2\"}}]}","partitionColumns":[],"configuration":{"delta.enableIcebergCompatV1":"true","delta.columnMapping.mode":"name","delta.columnMapping.maxColumnId":"1"},"createdTime":1716541145862}}
{"protocol":{"minReaderVersion":2,"minWriterVersion":7,"writerFeatures":["columnMapping","icebergCompatV1"]}}
Observed results
Expected results
icebergCompatV1 or icebergCompatV2 exists in readerFeatures.
Further details
Environment information
- Delta Lake version: 3.1.0
- Spark version: 3.5.0
- Scala version: 2.12
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
- [ ] Yes. I can contribute a fix for this bug independently.
- [ ] Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
- [x] No. I cannot contribute a bug fix at this time.
cc. @lzlfred @harperjiang
@vkorukanti @lzlfred @harperjiang Could you let me know whether this is incorrect documentation or implementation bug?
@ebyhr the document said that
If the table is on Writer Version 5 or 6: write a metaData action to add the delta.columnMapping.mode table property;
If the table is on Writer Version 7:
write a protocol action to add the feature columnMapping to both readerFeatures and writerFeatures
I believe the behavior you observed is just as described in the doc when we have writer version 5/6, and thus don't consider it a bug.
The writer version is 7 and it mentions "add the feature columnMapping to both readerFeatures and writerFeatures".
{"protocol":{"minReaderVersion":2,"minWriterVersion":7,"writerFeatures":["columnMapping","icebergCompatV1"]}}
Then, I would recommend updating the protocol. I expect readerFeatures always have columnMapping when it exists in writerFeatures from the current sentence.
The writer version is 7 and it mentions "add the feature columnMapping to both readerFeatures and writerFeatures".
{"protocol":{"minReaderVersion":2,"minWriterVersion":7,"writerFeatures":["columnMapping","icebergCompatV1"]}}Then, I would recommend updating the protocol. I expect readerFeatures always have
columnMappingwhen it exists in writerFeatures from the current sentence.
I think we need to update the protocol to make that paragraph clearer.
When reader version is 2 the readerFeatures doesn't exist, and if it does, it would break the protocol:
For new tables, when a new table is created with a Reader Version up to 2 and Writer Version 7, its protocol action must only contain writerFeatures.
Actually, after running some experiment I realized Spark Delta doesn't allow you to enable column mapping if reader is version 2 and writer is version 7.
But it allows you to create it, so it is inconsistent, I can't say what is the correct behavior, my guess is supporting reader 2 + writer 7 (with columnMapping writerFeatures) should be the considered compliant with spec.
ALTER TABLE Test SET TBLPROPERTIES (
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '7'
);
This fails:
ALTER TABLE Test SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name'
)
Your current table protocol version does not support changing column mapping modes
using delta.columnMapping.mode.
Required Delta protocol version for column mapping:
Protocol(3,7,[columnMapping],[columnMapping])
Your table's current Delta protocol version:
Protocol(2,7,None,[appendOnly,invariants])
Please enable Column Mapping on your Delta table with mapping mode 'name'.
You can use one of the following commands.
If your table is already on the required protocol version:
ALTER TABLE table_name SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')
If your table is not on the required protocol version and requires a protocol upgrade:
ALTER TABLE table_name SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '3',
'delta.minWriterVersion' = '7')
This works:
CREATE TABLE Test (Id INT) USING DELTA TBLPROPERTIES ('delta.minWriterVersion' = '7', 'delta.columnMapping.mode' = 'name');
I made a few tests, the behaviour is definitely buggy. When I enable column mapping and set minReaderVersion=2, then upgrade minReaderVersion=3, then there is no columnMapping in the reader protocol features.
Test 1
def test_002(self) -> None:
# BROKEN
# Create table with SQL API
# Migrate reader 1->2->3 usint SQL API
table_path = f"{self.builder.context.tmp_path}/table_002"
table_log_path = f"{table_path}/_delta_log"
table_log_0_path = f"{table_log_path}/00000000000000000000.json"
table_log_1_path = f"{table_log_path}/00000000000000000001.json"
table_log_2_path = f"{table_log_path}/00000000000000000002.json"
self.spark.sql(f"CREATE TABLE delta.`{table_path}` (a INT, b INT) USING delta")
self.spark.sql(
f"ALTER TABLE delta.`{table_path}` SET TBLPROPERTIES ("
"'delta.minReaderVersion'='2',"
"'delta.minWriterVersion'='7',"
"'delta.columnMapping.mode'='name')"
)
self.spark.sql(
f"ALTER TABLE delta.`{table_path}` SET TBLPROPERTIES ("
"'delta.minReaderVersion'='3',"
"'delta.minWriterVersion'='7',"
"'delta.columnMapping.mode'='name')"
)
self.spark.read.format("json").load(table_log_0_path).select("protocol").show(truncate=False)
self.spark.read.format("json").load(table_log_1_path).select("protocol").show(truncate=False)
self.spark.read.format("json").load(table_log_2_path).select("protocol").show(truncate=False)
pass
Test 1 result
+--------+
|protocol|
+--------+
|NULL |
|NULL |
|{1, 2} |
+--------+
+-----------------------------------------------+
|protocol |
+-----------------------------------------------+
|NULL |
|NULL |
|{2, 7, [columnMapping, appendOnly, invariants]}|
+-----------------------------------------------+
+---------------------------------------------------+
|protocol |
+---------------------------------------------------+
|NULL |
|NULL |
|{3, 7, [], [columnMapping, appendOnly, invariants]}|
On the other hand when I enable column mapping and set minReaderVersion=3 at the same time, then there is columnMapping in the reader protocol features.
Test 2
def test_003(self) -> None:
# WORKS
# Create table with SQL API
# Migrate reader 1->3 usint SQL API
table_path = f"{self.builder.context.tmp_path}/table_003"
table_log_path = f"{table_path}/_delta_log"
table_log_0_path = f"{table_log_path}/00000000000000000000.json"
table_log_1_path = f"{table_log_path}/00000000000000000001.json"
self.spark.sql(f"CREATE TABLE delta.`{table_path}` (a INT, b INT) USING delta")
self.spark.sql(
f"ALTER TABLE delta.`{table_path}` SET TBLPROPERTIES ("
"'delta.minReaderVersion'='3',"
"'delta.minWriterVersion'='7',"
"'delta.columnMapping.mode'='name')"
)
self.spark.read.format("json").load(table_log_0_path).select("protocol").show(truncate=False)
self.spark.read.format("json").load(table_log_1_path).select("protocol").show(truncate=False)
pass
Test 2 result
+--------+
|protocol|
+--------+
|NULL |
|NULL |
|{1, 2} |
+--------+
+----------------------------------------------------------------+
|protocol |
+----------------------------------------------------------------+
|NULL |
|NULL |
|{3, 7, [columnMapping], [columnMapping, appendOnly, invariants]}|
+----------------------------------------------------------------+
@felipepessoto @vkorukanti @lzlfred @harperjiang Hello! Is there any plans to fix the issue? :-)
Thanks @aurokk , we'll have a look at this