Core: Dropping an old partition column causes NPE (and corrupt metadata on v2 tables)
Apache Iceberg version
0.14.0 (latest release)
Query engine
Spark
Please describe the bug 🐞
On a format version 2 table, dropping an old partition column on an iceberg table causes a NullPointerException in PartitionSpecBuilder, and every subsequent operation on the table throws the same exception.
java.lang.NullPointerException: Cannot find source column: 2
at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:963)
at org.apache.iceberg.PartitionSpec$Builder.add(PartitionSpec.java:517)
at org.apache.iceberg.UnboundPartitionSpec.copyToBuilder(UnboundPartitionSpec.java:56)
at org.apache.iceberg.UnboundPartitionSpec.bind(UnboundPartitionSpec.java:44)
at org.apache.iceberg.PartitionSpecParser.fromJson(PartitionSpecParser.java:87)
On a v1 table, the table is still accessible after, but the ALTER TABLE throws the same NPE.
The issue is easily reproducible using the following script in a Spark Shell:
CREATE TABLE data.test_table (ts timestamp not null, day_of_ts date) USING iceberg PARTITIONED BY (day_of_ts);
ALTER TABLE data.test_table SET TBLPROPERTIES ('format-version' = '2');
ALTER TABLE data.test_table REPLACE PARTITION FIELD day_of_ts WITH days(ts);
ALTER TABLE data.test_table DROP COLUMN day_of_ts;
REFRESH TABLE data.test_table;
SELECT * FROM data.test_table;
On closer inspection of the metadata, I see that on a v1 table, the metadata is not updated when dropping the old partition field, which explains why the table is still working on v1 after, but I also don't see what the issue is with the v2 metadata.
I am using Spark Shell on Spark 3.3.0 with Iceberg 0.14.0
I'm able to reproduce this. It looks like we're deleting a field that used to be part of a partitionspec. I'm not sure how we handle this, but I don't think we should throw an NPE. It isn't available in the current schema but should be in the previous schema.

I think you're supposed to use DROP PARTITION FIELD days(ts). Or dropping the column directly.
Here's some examples from the tests: https://github.com/apache/iceberg/blob/84f40cff9b98ee15b706289e551078355bb8a7a5/spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAlterTablePartitionFields.java#L397-L422
I'm wondering if that's a correct usage of REPLACE PARTITION FIELD statement.
All of the examples of REPLACE PARTITION FIELD seem to be a transform of the same type:
Here's the actual SQL definition:
spark/v3.3/spark-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
72: | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
But here's the examples I could find:
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAlterTablePartitionFields.java
307: sql("ALTER TABLE %s REPLACE PARTITION FIELD days(ts) WITH hours(ts)", tableName);
333: sql("ALTER TABLE %s REPLACE PARTITION FIELD days(ts) WITH hours(ts) AS hour_col", tableName);
359: sql("ALTER TABLE %s REPLACE PARTITION FIELD day_col WITH hours(ts)", tableName);
385: sql("ALTER TABLE %s REPLACE PARTITION FIELD day_col WITH hours(ts) AS hour_col", tableName);
But this does seem like a bug. Did this still happen if the table was refreshed in between alter table statements?
I think it needs to still use DROP PARTITION FIELD days_of_ts in order to properly clean up the data, as that column is still a partition field in older snapshots. But again, this does seem like a bug.
Yes, it still happens with REFRESH TABLE in between.
The last two examples you posted from the TestAlterTablePartitionFields testcase looks like exactly what I am doing, except the I drop the old field from the table afterwards.
I expected REPLACE PARTITION FIELD to be a single snapshot operation version of DROP and ADD, but that may be wrong. I haven`t tested this doing it in two operations. I will try that when I have a free moment and report back.
I couldn't wait, so I just tried it.
Same result using DROP and ADD
create table data.test_table (ts timestamp not null, day_of_ts date) using iceberg partitioned by (day_of_ts);
alter table data.test_table SET TBLPROPERTIES ('format-version' = '2');
refresh table data.test_table;
alter table data.test_table DROP PARTITION FIELD day_of_ts;
refresh table data.test_table;
alter table data.test_table add partition field days(ts);
refresh table data.test_table;
alter table data.test_table drop column day_of_ts;
refresh table data.test_table;
select * from data.test_table;
As per my understanding, I think the issue here is that we are trying to bind all partition specs to to current schema which will always not be possible.
For ex : we find current schema and assign this to local var schema
https://github.com/apache/iceberg/blob/84f40cff9b98ee15b706289e551078355bb8a7a5/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L344-L351
we bind all partition specs to the schema which is equal to current schema
https://github.com/apache/iceberg/blob/84f40cff9b98ee15b706289e551078355bb8a7a5/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L381-L385
which is where I think it's failing as we have dropped day_of_ts so current schema doesn't have this and partition spec 0 (initial partitioning ) has reference to day_of_ts hence it's failing to bind it.
@kbendick The tests are different, it doesn't drop the source column.
I did some digging around in the code, and we bind the partition spec right away because we need the column type for the transform:
Builder add(int sourceId, int fieldId, String name, String transform) {
Types.NestedField column = schema.findField(sourceId);
Preconditions.checkNotNull(column, "Cannot find source column: %s", sourceId);
return add(sourceId, fieldId, name, Transforms.fromString(column.type(), transform));
}
There is a PR https://github.com/apache/iceberg/pull/5601 that will decouple this, and then we can also solve this issue.
A simple test reproduces this behavior:
@Test
public void testDropColumnOfOldPartitionField() {
sql("CREATE TABLE %s (id bigint NOT NULL, ts timestamp, day_of_ts date) USING iceberg PARTITIONED BY (day_of_ts)", tableName);
sql("ALTER TABLE %s SET TBLPROPERTIES ('format-version' = '2');", tableName);
sql("ALTER TABLE %s REPLACE PARTITION FIELD day_of_ts WITH days(ts)", tableName);
sql("ALTER TABLE %s DROP COLUMN day_of_ts", tableName);
}
I think this should be quite straightforward to fix this once https://github.com/apache/iceberg/pull/5601 has been merged
Created a PR: https://github.com/apache/iceberg/pull/5707