incubator-gluten
incubator-gluten copied to clipboard
Iceberg's parquets do not have field_id (schema evolution is broken)
Backend
VL (Velox)
Bug description
Iceberg spec requires field_ids are set:
Column IDs are required to be stored as field IDs on the parquet schema.
As I could understand the actual column ids from Iceberg schema are not passed here. So they cannot be written in Velox later.
https://github.com/apache/incubator-gluten/blob/2ec3ba751821d5e09a4da630c2b55e8a1a3ccb1b/cpp/velox/compute/VeloxRuntime.cc#L232
It looks like it would be possible to pass the ids in Velox part through IcebergColumnHandle after this PR
Am I right that there's no info about actual Iceberg column_ids in Java_org_apache_gluten_execution_IcebergWriteJniWrapper_init right now?
Gluten version
main branch
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs