delta [BUG] `columnMapping` readerFeatures is missing when icebergCompatV1(2) is enabled

Bug

Which Delta project/connector is this regarding?

[x] Spark
[ ] Standalone
[ ] Flink
[ ] Kernel
[ ] Other (fill in here)

Describe the problem

Writer Requirements for Column Mapping mentions:

write a protocol action to add the feature columnMapping to both readerFeatures and writerFeatures

However, protocol entry doesn't have columnMapping in readerFeatures field.

Steps to reproduce

Create a new table with icebergCompatV1 feature:

CREATE TABLE default.test_iceberg
(a INT)
USING DELTA
LOCATION 's3://test-bucket/test_iceberg'
TBLPROPERTIES ('delta.enableIcebergCompatV1'=true);

Confirm transaction log:

{"commitInfo":{"timestamp":1716541145876,"operation":"CREATE TABLE","operationParameters":{"isManaged":"false","description":null,"partitionBy":"[]","properties":"{\"delta.enableIcebergCompatV1\":\"true\",\"delta.columnMapping.mode\":\"name\",\"delta.columnMapping.maxColumnId\":\"1\"}"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.5.0 Delta-Lake/3.1.0","txnId":"b172aa56-9393-4750-ac30-2c1996497efb"}}
{"metaData":{"id":"8d56ae44-1648-4fd5-a920-81f5f0725ab5","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"a\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{\"delta.columnMapping.id\":1,\"delta.columnMapping.physicalName\":\"col-39727da2-bc31-4d96-a0a9-b3f16113fde2\"}}]}","partitionColumns":[],"configuration":{"delta.enableIcebergCompatV1":"true","delta.columnMapping.mode":"name","delta.columnMapping.maxColumnId":"1"},"createdTime":1716541145862}}
{"protocol":{"minReaderVersion":2,"minWriterVersion":7,"writerFeatures":["columnMapping","icebergCompatV1"]}}

Observed results

Expected results

icebergCompatV1 or icebergCompatV2 exists in readerFeatures.

Further details

Environment information

Delta Lake version: 3.1.0
Spark version: 3.5.0
Scala version: 2.12

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

[ ] Yes. I can contribute a fix for this bug independently.
[ ] Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
[x] No. I cannot contribute a bug fix at this time.

May 24 '24 09:05 ebyhr

cc. @lzlfred @harperjiang

May 24 '24 15:05 vkorukanti

@vkorukanti @lzlfred @harperjiang Could you let me know whether this is incorrect documentation or implementation bug?

Jun 05 '24 00:06 ebyhr

@ebyhr the document said that

If the table is on Writer Version 5 or 6: write a metaData action to add the delta.columnMapping.mode table property;
If the table is on Writer Version 7:
write a protocol action to add the feature columnMapping to both readerFeatures and writerFeatures

I believe the behavior you observed is just as described in the doc when we have writer version 5/6, and thus don't consider it a bug.

Jun 05 '24 02:06 harperjiang

The writer version is 7 and it mentions "add the feature columnMapping to both readerFeatures and writerFeatures".

{"protocol":{"minReaderVersion":2,"minWriterVersion":7,"writerFeatures":["columnMapping","icebergCompatV1"]}}

Then, I would recommend updating the protocol. I expect readerFeatures always have columnMapping when it exists in writerFeatures from the current sentence.

Jun 05 '24 02:06 ebyhr

The writer version is 7 and it mentions "add the feature columnMapping to both readerFeatures and writerFeatures".
{"protocol":{"minReaderVersion":2,"minWriterVersion":7,"writerFeatures":["columnMapping","icebergCompatV1"]}}
Then, I would recommend updating the protocol. I expect readerFeatures always have columnMapping when it exists in writerFeatures from the current sentence.

I think we need to update the protocol to make that paragraph clearer.

When reader version is 2 the readerFeatures doesn't exist, and if it does, it would break the protocol:

For new tables, when a new table is created with a Reader Version up to 2 and Writer Version 7, its protocol action must only contain writerFeatures.

Jun 06 '24 00:06 felipepessoto

Actually, after running some experiment I realized Spark Delta doesn't allow you to enable column mapping if reader is version 2 and writer is version 7.

But it allows you to create it, so it is inconsistent, I can't say what is the correct behavior, my guess is supporting reader 2 + writer 7 (with columnMapping writerFeatures) should be the considered compliant with spec.

ALTER TABLE Test SET TBLPROPERTIES (
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '7'
  );

This fails:

ALTER TABLE Test SET TBLPROPERTIES (
    'delta.columnMapping.mode' = 'name'
  )

Your current table protocol version does not support changing column mapping modes
using delta.columnMapping.mode.

Required Delta protocol version for column mapping:
Protocol(3,7,[columnMapping],[columnMapping])
Your table's current Delta protocol version:
Protocol(2,7,None,[appendOnly,invariants])

Please enable Column Mapping on your Delta table with mapping mode 'name'.
You can use one of the following commands.

If your table is already on the required protocol version:
ALTER TABLE table_name SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')

If your table is not on the required protocol version and requires a protocol upgrade:
ALTER TABLE table_name SET TBLPROPERTIES (
   'delta.columnMapping.mode' = 'name',
   'delta.minReaderVersion' = '3',
   'delta.minWriterVersion' = '7')

This works:

CREATE TABLE Test (Id INT) USING DELTA TBLPROPERTIES ('delta.minWriterVersion' = '7', 'delta.columnMapping.mode' = 'name');

Jun 07 '24 21:06 felipepessoto

I made a few tests, the behaviour is definitely buggy. When I enable column mapping and set minReaderVersion=2, then upgrade minReaderVersion=3, then there is no columnMapping in the reader protocol features.

Test 1

    def test_002(self) -> None:
        # BROKEN
        # Create table with SQL API
        # Migrate reader 1->2->3 usint SQL API

        table_path = f"{self.builder.context.tmp_path}/table_002"
        table_log_path = f"{table_path}/_delta_log"
        table_log_0_path = f"{table_log_path}/00000000000000000000.json"
        table_log_1_path = f"{table_log_path}/00000000000000000001.json"
        table_log_2_path = f"{table_log_path}/00000000000000000002.json"
        self.spark.sql(f"CREATE TABLE delta.`{table_path}` (a INT, b INT) USING delta")
        self.spark.sql(
            f"ALTER TABLE delta.`{table_path}` SET TBLPROPERTIES ("
            "'delta.minReaderVersion'='2',"
            "'delta.minWriterVersion'='7',"
            "'delta.columnMapping.mode'='name')"
        )
        self.spark.sql(
            f"ALTER TABLE delta.`{table_path}` SET TBLPROPERTIES ("
            "'delta.minReaderVersion'='3',"
            "'delta.minWriterVersion'='7',"
            "'delta.columnMapping.mode'='name')"
        )
        self.spark.read.format("json").load(table_log_0_path).select("protocol").show(truncate=False)
        self.spark.read.format("json").load(table_log_1_path).select("protocol").show(truncate=False)
        self.spark.read.format("json").load(table_log_2_path).select("protocol").show(truncate=False)

        pass

Test 1 result

+--------+
|protocol|
+--------+
|NULL    |
|NULL    |
|{1, 2}  |
+--------+

+-----------------------------------------------+
|protocol                                       |
+-----------------------------------------------+
|NULL                                           |
|NULL                                           |
|{2, 7, [columnMapping, appendOnly, invariants]}|
+-----------------------------------------------+

+---------------------------------------------------+
|protocol                                           |
+---------------------------------------------------+
|NULL                                               |
|NULL                                               |
|{3, 7, [], [columnMapping, appendOnly, invariants]}|

On the other hand when I enable column mapping and set minReaderVersion=3 at the same time, then there is columnMapping in the reader protocol features.

Test 2

    def test_003(self) -> None:
        # WORKS
        # Create table with SQL API
        # Migrate reader 1->3 usint SQL API

        table_path = f"{self.builder.context.tmp_path}/table_003"
        table_log_path = f"{table_path}/_delta_log"
        table_log_0_path = f"{table_log_path}/00000000000000000000.json"
        table_log_1_path = f"{table_log_path}/00000000000000000001.json"
        self.spark.sql(f"CREATE TABLE delta.`{table_path}` (a INT, b INT) USING delta")
        self.spark.sql(
            f"ALTER TABLE delta.`{table_path}` SET TBLPROPERTIES ("
            "'delta.minReaderVersion'='3',"
            "'delta.minWriterVersion'='7',"
            "'delta.columnMapping.mode'='name')"
        )
        self.spark.read.format("json").load(table_log_0_path).select("protocol").show(truncate=False)
        self.spark.read.format("json").load(table_log_1_path).select("protocol").show(truncate=False)

        pass

Test 2 result

+--------+
|protocol|
+--------+
|NULL    |
|NULL    |
|{1, 2}  |
+--------+

+----------------------------------------------------------------+
|protocol                                                        |
+----------------------------------------------------------------+
|NULL                                                            |
|NULL                                                            |
|{3, 7, [columnMapping], [columnMapping, appendOnly, invariants]}|
+----------------------------------------------------------------+

Feb 05 '25 21:02 aurokk

@felipepessoto @vkorukanti @lzlfred @harperjiang Hello! Is there any plans to fix the issue? :-)

Feb 05 '25 21:02 aurokk

Thanks @aurokk , we'll have a look at this

Feb 11 '25 05:02 raveeram-db