DayTransform failure for downcasted timestamp column
Apache Iceberg version
0.8.1 (latest release)
Please describe the bug 🐞
Given a simple schema with a DayTransform over a Timestamp column, an error is thrown during write related to mixing downcasted and non-downcasted timestamps.
This would also be resolved with support for nanosecond timestamp types.
schema = Schema(
NestedField(field_id=1, name='timestamp', field_type=TimestampType(), required=False),
)
partition_spec = PartitionSpec(
PartitionField(source_id=1, field_id=1001, transform=DayTransform(), name='date'),
)
schema = pa.schema([pa.field('timestamp', pa.timestamp('ns'))])
data = pa.Table.from_pydict({'timestamp': [1]}, schema=schema)
table.append(data)
.local/lib/python3.10/site-packages/pyiceberg/table/init.py:984: in append tx.append(df=df, snapshot_properties=snapshot_properties) .local/lib/python3.10/site-packages/pyiceberg/table/init.py:417: in append for data_file in data_files: .local/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py:2636: in _dataframe_to_data_files partitions = _determine_partitions(spec=table_metadata.spec(), schema=table_metadata.schema(), arrow_table=df) .local/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py:2715: in _determine_partitions partition_values_table = pa.table({ .local/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py:2716: in
str(partition.field_id): partition.transform.pyarrow_transform(field.field_type)(arrow_table[field.name]) .local/lib/python3.10/site-packages/pyiceberg/transforms.py:560: in return lambda v: pc.days_between(pa.scalar(epoch), v) if v is not None else None .local/lib/python3.10/site-packages/pyarrow/compute.py:247: in wrapper return func.call(args, None, memory_pool) pyarrow/_compute.pyx:393: in pyarrow._compute.Function.call ??? pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status ??? E pyarrow.lib.ArrowNotImplementedError: Function 'days_between' has no kernel matching input types (timestamp[us], timestamp[ns]) pyarrow/error.pxi:92: ArrowNotImplementedError
Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
Currently Iceberg does not have nanosecond support https://py.iceberg.apache.org/configuration/#nanoseconds-support
To compensate, we can automatically downcast to microsecond on write by setting the config described in the doc
Thanks Kevin, I should have been more clear. I am using the downcasting functionality, which works great for unpartitioned tables. Even an IdentityTransform() partition over a downcasted timestamp column works fine, but any of the time truncations like DayTransform or YearTransform result in the arrow kernel error above.
A table with the partition spec will handle downcasting timestamps correctly on write
partition_spec = PartitionSpec(
PartitionField(source_id=1, field_id=1001, transform=IdentityTransform(), name='ts'),
)
But this partition spec throws the error
partition_spec = PartitionSpec(
PartitionField(source_id=1, field_id=1001, transform=DayTransform(), name='date'),
)
The write code
os.environ['PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE'] = '1'
schema = pa.schema([pa.field('timestamp', pa.timestamp('ns'))])
data = pa.Table.from_pydict({'timestamp': [1]}, schema=schema)
table.append(data)
I had the same issue with all types of Time partitions and data with ns unit. Looking at the code it seems that the downcast property is used when checking then schema and when writing the table data, but not when determining the partitions, so I get errors like :
Function 'years_between' has no kernel matching input types (timestamp[us], timestamp[ns])
I have the exact same error, trying to append a table with timestamp[ns] into an Iceberg table with timestamp (so, trying to use the automatic downcast) but it fails since the PartitionSpec also includes a DayTransform() as OP suggested.
# .pyiceberg.yaml
# ...
downcast-ns-timestamp-to-us-on-write: true
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Function 'days_between' has no kernel matching input types (timestamp[us], timestamp[ns])