"Cannot promote timestamp to timestamptz" error when loading Dremio created table
Apache Iceberg version
0.10.0 (latest release)
Please describe the bug 🐞
My python version is 3.12. When I use pyiceberg to read a dataset table created Dremio using below code, i got "Cannot promote timestamp to timestamptz" error but if i use the same dataset to create table in spark, the table can be loaded correctly.
namespace = "test" table_name = "Files_Created_by_Dremio" table = nessie_catalog.load_table(f"{namespace}.{table_name}")
print(table.name()) print(table.schema()) table.scan().to_pandas()
('test', 'Files_Created_by_Dremio')
table {
1: ID: optional long
2: Name: optional string
3: FileImportID: optional int
4: ReceivedDateTime: optional timestamptz
5: IsMoved: optional boolean
6: MoveAttempt: optional int
7: IsProcessed: optional boolean
8: ProcessingAttempt: optional int
9: TotalDataRows: optional int
10: TotalProcessedDataRows: optional int
11: TotalExcludedDataRows: optional int
12: TotalAggregatedRows: optional int
13: IsArchived: optional boolean
14: ArchivalAttempt: optional int
15: IsMovedToTransaction: optional boolean
16: MovedToTransactionAttempt: optional int
17: IsEnriched: optional boolean
18: IsEnrichmentAttempt: optional int
19: IsAEnriched: optional boolean
20: AEnrichmentAttempt: optional int
21: IsBEnriched: optional boolean
22: BEnrichmentAttempt: optional int
23: IsCEnriched: optional boolean
24: CEnrichmentAttempt: optional int
25: HasPendingEnrichment: optional boolean
26: HasPendingAEnrichment: optional boolean
27: HasPendingBEnrichment: optional boolean
28: HasPendingCEnrichment: optional boolean
29: FileBusinessDate: optional timestamptz
30: IsUploaded: optional boolean
31: IsValid: optional boolean
32: Status: optional string
33: StatusMessage: optional string
34: LastActionDateTime: optional timestamptz
35: ActedBy: optional int
36: CreatedOn: optional timestamptz
37: IsAttempedAll: optional boolean
38: IsAttempedAll2: optional boolean
39: IsClosed: optional boolean
40: SparkUniqueID: optional string
}
ResolveError Traceback (most recent call last)
Cell In[3], line 12
8 print(table.location())
9 print(table.schema())
---> 12 table.scan().to_pandas()
ResolveError: Cannot promote timestamp to timestamptz
1465 def to_pandas(self, **kwargs: Any) -> pd.DataFrame:
1466 """Read a Pandas DataFrame eagerly from this Iceberg table.
1467
1468 Returns:
1469 pd.DataFrame: Materialized Pandas Dataframe from the Iceberg table
1470 """
-> 1471 return self.to_arrow().to_pandas(**kwargs)
1427 """Read an Arrow table eagerly from this DataScan.
1428
1429 All rows will be loaded into memory at once.
(...) 1432 pa.Table: Materialized Arrow Table from the Iceberg table's DataScan
1433 """
1434 from pyiceberg.io.pyarrow import ArrowScan
1436 return ArrowScan(
1437 self.table_metadata, self.io, self.projection(), self.row_filter, self.case_sensitive, self.limit
-> 1438 ).to_table(self.plan_files())
The tables i created on Dremio and spark are actually from the same dataset.
in dremio:
CREATE TABLE nessie.test.Files_Created_by_Dremio AS select * from nessie.test.Files
in spark
CREATE TABLE nessie.test.Files_Created_by_Spark AS select * from nessie.test.Files
Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
This is related to how Dremio creates the parquet files, and how pyiceberg interperts them.
See this communit thread in Dremio's community forum.
Basically Dremio only support timestamp types. However, they mark the actual parquet files with a utc-adjustment flag, which in turn pyiceberg decide to interpurt as a TimestampTz, even though the concrete schema type is Timestamp. This is a pyiceberg bug, at least that's the way I see it.
I think you should be able to work around it if you create a complementry table using pyiceberg with timestamptz columns, and copy data to it. I think I found the actual location in the pyiceberg source code where this manipulation happens, but have yet to open an issue
This might also be a Dremio issue https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/5 which was mentioned here as fixed https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/7
@chrisqiqiu could you give it a try with the new Dremio release?
pyiceberg does the proper schema conversion https://github.com/apache/iceberg-python/blob/8878b2c260bf62ad16ef6f8e8d3a6ae93103eff7/pyiceberg/utils/schema_conversion.py#L382-L386
@kevinjqliu if you notice who opened this thread in the dremio community forum, you see he and I share the same avatar 😉
There was a bug in dremio where they wrote an incorrect type to the schema, they fixed it in 26.0.5, but nonetheless, the bug in the schema_conversion.py module of PyIceberg remains, as it causes the library to incorrectly interpert the schema and then fail on type promotion later down the line.
If the column is a timestamp column (which is what Dremio supportts ATM) why should adjust-to-utc flag on a parquet file column cause it to be read differently? Dremio may or may not set this flag but it does mean to store a TZ-less timestamp.
If the schema defines a TimestampType column, and data is a TimestampType column, I feel the library should not back itself into corners such as these, especially if the gain is adding a utc indication for timestamps, as the user did not intended to include TZ awareness in the schema.
That's why I preceive this issue as a Pyiceberg, rather than a Dremio one
ah i did not make that connection 😄
Ok i think i see whats going on here.
From the Dremio thread
Thank you for bringing this up and for sharing the details, @sheinbergon. You’re right - in Dremio OSS 26.0.0, Iceberg tables created with timestamp columns may be recorded in the metadata as timestamptz, even though Dremio only supports timestamp. As you noted, this can cause compatibility issues with external readers.
The issue is that Dremio incorrectly produced the metadata ~(in avro)~ which specified certain columns as timestamptz while the data files (in parquet) specified those columns as timestamp (without tz).
If that is the case, I believe pyiceberg is doing the right thing by failing on "type promotion". Its using the metadata as the source of truth and expects the data files to have the same type, in this case timestamptz. We cannot process timestamptz as timestamp or vice versa since they are considered to be different data types.
If the column is a timestamp column (which is what Dremio supportts ATM) why should adjust-to-utc flag on a parquet file column cause it to be read differently?
The spec uses adjust-to-utc as a signal for timestamp vs timestamptz. And consider these to be 2 different data types. See https://iceberg.apache.org/spec/#avro
Furthermore, "Avro type annotation adjust-to-utc is an Iceberg convention; default value is false if not present."
Please LMK if i'm interpreting this correctly! I think the proper fix is to align the metadata and data file with the same data type, otherwise the table may be deemed to be corrupt.
if i use the same dataset to create table in spark, the table can be loaded correctly.
My gut feeling says this is unintentional and might even be a bug. Let me double check
You are interperting this correctly. BTW The spec is on the parquet level, not the avro files.
If that is the case, I believe pyiceberg is doing the right thing by failing on "type promotion". Its using the metadata as the source of truth and expects the data files to have the same type, in this case timestamptz.
So the failure is correct, of course, but the decision to chart data coming from data file as timestamptz while the schema states it's expecting timestamp, is the cause for this issue, and is wrong IMO (it's also a PyIceberg decision)
@kevinjqliu moreover, to me Iceberg is all about interoperability, Tables are interacted with from multiple sources. Some vendors will have different approach to writing data files. as long schema types are preserved, it should be valid. I think PyIceberg's decision to treat data differrently regardless of the schema compromises this principle
This might also be a Dremio issue https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/5 which was mentioned here as fixed https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/7
@chrisqiqiu could you give it a try with the new Dremio release?
pyiceberg does the proper schema conversion
iceberg-python/pyiceberg/utils/schema_conversion.py
Lines 382 to 386 in 8878b2c
elif logical_type == "timestamp-micros": if avro_logical_type.get("adjust-to-utc", False) is True: return TimestamptzType() else: return TimestampType()
@kevinjqliu After upgrading Dremio to the 26.0.5, it works fine now. Thanks, @sheinbergon !
Thanks for confirming, @chrisqiqiu. Glad this is resolved. I see the fix from Dremio here. However, I believe there may be an issue with the fix: the TIMESTAMPMILLI data type should never set isAdjustedToUTC to true. I’ll raise a separate issue with Dremio regarding this.
As for the underlying problem, I uncovered a few interesting behaviors and learnings:
- Schema mismatch between the table and data file is not compliant with the Iceberg spec and can lead to undefined behavior. Specifically:
- Iceberg timestamp fields expect the Parquet schema to be timestamp with adjustToUtc=false.
- Iceberg timestamptz fields expect the Parquet schema to be timestamp with adjustToUtc=true.
-
Spark behavior: Spark allows reading mismatched timestamp/timestamptz from Parquet. This is mentioned above, and I also verified it locally.
-
PyIceberg behavior: PyIceberg allows reading timestamp as Iceberg timestamptz (#2333), but not timestamptz as Iceberg timestamp. This is also noted above.
Some vendors will have different approaches to writing data files. As long as schema types are preserved, it should be valid.
Agreed, but I think this particular issue falls into a gray area. The writer is not writing in accordance with the Iceberg spec, which cause undefined behavior when reading. Everyone should write data in accordance with the Iceberg spec. On the read side, we may want to clarity this undefined behavior in the spec.
I think PyIceberg's decision to treat data differently regardless of the schema compromises this principle.
Agreed. PyIceberg currently behaves differently from Spark. Since this is not aligned with the spec, the behavior is undefined. However, I believe we can work toward aligning PyIceberg with Spark here.
When reading parquet timestamp as Iceberg timestamptz (in #2333)
- We ignore the isAdjustedToUTC=False flag and assume it to be true.
- Semantically, writer wrote wall time and reader reads as if it was in UTC
When reading parquet timestamptz as Iceberg timestamp (not yet implemented)
- We can just ignore the isAdjustedToUTC=True flag since Iceberg timestamp just reads the value.
- Semantically, writer wrote the timestamp adjusted to UTC and reader read as wall time
Both path feels potentially dangerous tbh
TODOs:
- [x] reach out to Dremio about the fix (https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/9?u=kevinjqliu)
- [ ] clarity iceberg spec when reading timestamptz as timestamp or timestamp as timestamptz
- [ ] allow pyiceberg to read timestamptz as Iceberg timestamp (#2708)
Dremio verified that the change is indeed correct. Dremio TIMESTAMP type are UTC-normalized.
https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/10?u=kevinjqliu
@kevinjqliu great. However they still state isAdjustedToUTC=true in the parquet data files. So pyiceberg still treats timestamp fields as timestamptz. My original fix request from Dremio was to ensure the schema type is clearlity written as timestamp. What are we planning to do on pyiceberg's end regarding the timstamptz interpertation?