iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

"Cannot promote timestamp to timestamptz" error when loading Dremio created table

Open chrisqiqiu opened this issue 2 months ago • 11 comments

Apache Iceberg version

0.10.0 (latest release)

Please describe the bug 🐞

My python version is 3.12. When I use pyiceberg to read a dataset table created Dremio using below code, i got "Cannot promote timestamp to timestamptz" error but if i use the same dataset to create table in spark, the table can be loaded correctly.

namespace = "test" table_name = "Files_Created_by_Dremio" table = nessie_catalog.load_table(f"{namespace}.{table_name}")

print(table.name()) print(table.schema()) table.scan().to_pandas()

('test', 'Files_Created_by_Dremio')

table {

1: ID: optional long

2: Name: optional string

3: FileImportID: optional int

4: ReceivedDateTime: optional timestamptz

5: IsMoved: optional boolean

6: MoveAttempt: optional int

7: IsProcessed: optional boolean

8: ProcessingAttempt: optional int

9: TotalDataRows: optional int

10: TotalProcessedDataRows: optional int

11: TotalExcludedDataRows: optional int

12: TotalAggregatedRows: optional int

13: IsArchived: optional boolean

14: ArchivalAttempt: optional int

15: IsMovedToTransaction: optional boolean

16: MovedToTransactionAttempt: optional int

17: IsEnriched: optional boolean

18: IsEnrichmentAttempt: optional int

19: IsAEnriched: optional boolean

20: AEnrichmentAttempt: optional int

21: IsBEnriched: optional boolean

22: BEnrichmentAttempt: optional int

23: IsCEnriched: optional boolean

24: CEnrichmentAttempt: optional int

25: HasPendingEnrichment: optional boolean

26: HasPendingAEnrichment: optional boolean

27: HasPendingBEnrichment: optional boolean

28: HasPendingCEnrichment: optional boolean

29: FileBusinessDate: optional timestamptz

30: IsUploaded: optional boolean

31: IsValid: optional boolean

32: Status: optional string

33: StatusMessage: optional string

34: LastActionDateTime: optional timestamptz

35: ActedBy: optional int

36: CreatedOn: optional timestamptz

37: IsAttempedAll: optional boolean

38: IsAttempedAll2: optional boolean

39: IsClosed: optional boolean

40: SparkUniqueID: optional string

}


ResolveError Traceback (most recent call last)

Cell In[3], line 12

  8 print(table.location())

  9 print(table.schema())

---> 12 table.scan().to_pandas()

ResolveError: Cannot promote timestamp to timestamptz

1465 def to_pandas(self, **kwargs: Any) -> pd.DataFrame:

1466 """Read a Pandas DataFrame eagerly from this Iceberg table.

1467

1468 Returns:

1469 pd.DataFrame: Materialized Pandas Dataframe from the Iceberg table

1470 """

-> 1471 return self.to_arrow().to_pandas(**kwargs)

1427 """Read an Arrow table eagerly from this DataScan.

1428

1429 All rows will be loaded into memory at once.

(...) 1432 pa.Table: Materialized Arrow Table from the Iceberg table's DataScan

1433 """

1434 from pyiceberg.io.pyarrow import ArrowScan

1436 return ArrowScan(

1437 self.table_metadata, self.io, self.projection(), self.row_filter, self.case_sensitive, self.limit

-> 1438 ).to_table(self.plan_files())

The tables i created on Dremio and spark are actually from the same dataset.

in dremio:

CREATE TABLE nessie.test.Files_Created_by_Dremio AS select * from nessie.test.Files

in spark

CREATE TABLE nessie.test.Files_Created_by_Spark AS select * from nessie.test.Files

Willingness to contribute

  • [ ] I can contribute a fix for this bug independently
  • [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • [ ] I cannot contribute a fix for this bug at this time

chrisqiqiu avatar Oct 27 '25 14:10 chrisqiqiu

This is related to how Dremio creates the parquet files, and how pyiceberg interperts them.

See this communit thread in Dremio's community forum.

Basically Dremio only support timestamp types. However, they mark the actual parquet files with a utc-adjustment flag, which in turn pyiceberg decide to interpurt as a TimestampTz, even though the concrete schema type is Timestamp. This is a pyiceberg bug, at least that's the way I see it.

I think you should be able to work around it if you create a complementry table using pyiceberg with timestamptz columns, and copy data to it. I think I found the actual location in the pyiceberg source code where this manipulation happens, but have yet to open an issue

sheinbergon avatar Oct 27 '25 23:10 sheinbergon

This might also be a Dremio issue https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/5 which was mentioned here as fixed https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/7

@chrisqiqiu could you give it a try with the new Dremio release?

pyiceberg does the proper schema conversion https://github.com/apache/iceberg-python/blob/8878b2c260bf62ad16ef6f8e8d3a6ae93103eff7/pyiceberg/utils/schema_conversion.py#L382-L386

kevinjqliu avatar Nov 03 '25 17:11 kevinjqliu

@kevinjqliu if you notice who opened this thread in the dremio community forum, you see he and I share the same avatar 😉

There was a bug in dremio where they wrote an incorrect type to the schema, they fixed it in 26.0.5, but nonetheless, the bug in the schema_conversion.py module of PyIceberg remains, as it causes the library to incorrectly interpert the schema and then fail on type promotion later down the line.

If the column is a timestamp column (which is what Dremio supportts ATM) why should adjust-to-utc flag on a parquet file column cause it to be read differently? Dremio may or may not set this flag but it does mean to store a TZ-less timestamp.

If the schema defines a TimestampType column, and data is a TimestampType column, I feel the library should not back itself into corners such as these, especially if the gain is adding a utc indication for timestamps, as the user did not intended to include TZ awareness in the schema.

That's why I preceive this issue as a Pyiceberg, rather than a Dremio one

sheinbergon avatar Nov 04 '25 21:11 sheinbergon

ah i did not make that connection 😄

Ok i think i see whats going on here.

From the Dremio thread

Thank you for bringing this up and for sharing the details, @sheinbergon. You’re right - in Dremio OSS 26.0.0, Iceberg tables created with timestamp columns may be recorded in the metadata as timestamptz, even though Dremio only supports timestamp. As you noted, this can cause compatibility issues with external readers.

The issue is that Dremio incorrectly produced the metadata ~(in avro)~ which specified certain columns as timestamptz while the data files (in parquet) specified those columns as timestamp (without tz).

If that is the case, I believe pyiceberg is doing the right thing by failing on "type promotion". Its using the metadata as the source of truth and expects the data files to have the same type, in this case timestamptz. We cannot process timestamptz as timestamp or vice versa since they are considered to be different data types.

If the column is a timestamp column (which is what Dremio supportts ATM) why should adjust-to-utc flag on a parquet file column cause it to be read differently?

The spec uses adjust-to-utc as a signal for timestamp vs timestamptz. And consider these to be 2 different data types. See https://iceberg.apache.org/spec/#avro Furthermore, "Avro type annotation adjust-to-utc is an Iceberg convention; default value is false if not present."

kevinjqliu avatar Nov 05 '25 00:11 kevinjqliu

Please LMK if i'm interpreting this correctly! I think the proper fix is to align the metadata and data file with the same data type, otherwise the table may be deemed to be corrupt.

if i use the same dataset to create table in spark, the table can be loaded correctly.

My gut feeling says this is unintentional and might even be a bug. Let me double check

kevinjqliu avatar Nov 05 '25 00:11 kevinjqliu

You are interperting this correctly. BTW The spec is on the parquet level, not the avro files.

If that is the case, I believe pyiceberg is doing the right thing by failing on "type promotion". Its using the metadata as the source of truth and expects the data files to have the same type, in this case timestamptz.

So the failure is correct, of course, but the decision to chart data coming from data file as timestamptz while the schema states it's expecting timestamp, is the cause for this issue, and is wrong IMO (it's also a PyIceberg decision)

sheinbergon avatar Nov 05 '25 10:11 sheinbergon

@kevinjqliu moreover, to me Iceberg is all about interoperability, Tables are interacted with from multiple sources. Some vendors will have different approach to writing data files. as long schema types are preserved, it should be valid. I think PyIceberg's decision to treat data differrently regardless of the schema compromises this principle

sheinbergon avatar Nov 05 '25 10:11 sheinbergon

This might also be a Dremio issue https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/5 which was mentioned here as fixed https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/7

@chrisqiqiu could you give it a try with the new Dremio release?

pyiceberg does the proper schema conversion

iceberg-python/pyiceberg/utils/schema_conversion.py

Lines 382 to 386 in 8878b2c

elif logical_type == "timestamp-micros": if avro_logical_type.get("adjust-to-utc", False) is True: return TimestamptzType() else: return TimestampType()

@kevinjqliu After upgrading Dremio to the 26.0.5, it works fine now. Thanks, @sheinbergon !

chrisqiqiu avatar Nov 07 '25 15:11 chrisqiqiu

Thanks for confirming, @chrisqiqiu. Glad this is resolved. I see the fix from Dremio here. However, I believe there may be an issue with the fix: the TIMESTAMPMILLI data type should never set isAdjustedToUTC to true. I’ll raise a separate issue with Dremio regarding this.

As for the underlying problem, I uncovered a few interesting behaviors and learnings:

  1. Schema mismatch between the table and data file is not compliant with the Iceberg spec and can lead to undefined behavior. Specifically:
  • Iceberg timestamp fields expect the Parquet schema to be timestamp with adjustToUtc=false.
  • Iceberg timestamptz fields expect the Parquet schema to be timestamp with adjustToUtc=true.
  1. Spark behavior: Spark allows reading mismatched timestamp/timestamptz from Parquet. This is mentioned above, and I also verified it locally.

  2. PyIceberg behavior: PyIceberg allows reading timestamp as Iceberg timestamptz (#2333), but not timestamptz as Iceberg timestamp. This is also noted above.

Some vendors will have different approaches to writing data files. As long as schema types are preserved, it should be valid.

Agreed, but I think this particular issue falls into a gray area. The writer is not writing in accordance with the Iceberg spec, which cause undefined behavior when reading. Everyone should write data in accordance with the Iceberg spec. On the read side, we may want to clarity this undefined behavior in the spec.

I think PyIceberg's decision to treat data differently regardless of the schema compromises this principle.

Agreed. PyIceberg currently behaves differently from Spark. Since this is not aligned with the spec, the behavior is undefined. However, I believe we can work toward aligning PyIceberg with Spark here.

kevinjqliu avatar Nov 09 '25 16:11 kevinjqliu

When reading parquet timestamp as Iceberg timestamptz (in #2333)

  • We ignore the isAdjustedToUTC=False flag and assume it to be true.
  • Semantically, writer wrote wall time and reader reads as if it was in UTC

When reading parquet timestamptz as Iceberg timestamp (not yet implemented)

  • We can just ignore the isAdjustedToUTC=True flag since Iceberg timestamp just reads the value.
  • Semantically, writer wrote the timestamp adjusted to UTC and reader read as wall time

Both path feels potentially dangerous tbh

kevinjqliu avatar Nov 09 '25 16:11 kevinjqliu

TODOs:

  • [x] reach out to Dremio about the fix (https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/9?u=kevinjqliu)
  • [ ] clarity iceberg spec when reading timestamptz as timestamp or timestamp as timestamptz
  • [ ] allow pyiceberg to read timestamptz as Iceberg timestamp (#2708)

kevinjqliu avatar Nov 09 '25 16:11 kevinjqliu

Dremio verified that the change is indeed correct. Dremio TIMESTAMP type are UTC-normalized.

https://community.dremio.com/t/incorrect-timestamp-field-type-in-for-created-iceberg-table/13095/10?u=kevinjqliu

kevinjqliu avatar Nov 17 '25 16:11 kevinjqliu

@kevinjqliu great. However they still state isAdjustedToUTC=true in the parquet data files. So pyiceberg still treats timestamp fields as timestamptz. My original fix request from Dremio was to ensure the schema type is clearlity written as timestamp. What are we planning to do on pyiceberg's end regarding the timstamptz interpertation?

sheinbergon avatar Nov 17 '25 19:11 sheinbergon