iceberg-python Support for timestamp downcasting when loading data to iceberg tables

Apache Iceberg version

0.7.0 (latest release)

Please describe the bug 🐞

As of release 0.7.0, pyiceberg tables support new data-loading method pyiceberg.table.Table.add_files(). However, this method currently does not respect well documented setting downcast-ns-timestamp-to-us-on-write.

Setting downcast-ns-timestamp-to-us-on-write is always defaulted to False if Table.add_files() is used. No matter if config file explicitly specifies downcast-ns-timestamp-to-us-on-write: "True".

Env variable PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE is not respected as well.

Aug 12 '24 11:08 fusion2222

I also have patch for this ready, however it seems like I have no permissions to push a new branch and to create PR

diff --git a/pyiceberg/io/pyarrow.py b/pyiceberg/io/pyarrow.py
index b99c3b1..153b8a5 100644
--- a/pyiceberg/io/pyarrow.py
+++ b/pyiceberg/io/pyarrow.py
@@ -2303,6 +2303,8 @@ def _check_pyarrow_schema_compatible(
 
 
 def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_paths: Iterator[str]) -> Iterator[DataFile]:
+    from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE
+
     for file_path in file_paths:
         input_file = io.new_input(file_path)
         with input_file.open() as input_stream:
@@ -2313,7 +2315,12 @@ def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_
                 f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids"
             )
         schema = table_metadata.schema()
-        _check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema())
+        downcast_ns_timestamp_to_us = Config().get_bool(DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE) or False
+        _check_pyarrow_schema_compatible(
+            schema,
+            parquet_metadata.schema.to_arrow_schema(),
+            downcast_ns_timestamp_to_us
+        )
 
         statistics = data_file_statistics_from_parquet_metadata(
             parquet_metadata=parquet_metadata,

Aug 12 '24 11:08 fusion2222

Hi @fusion2222 thank you for raising this issue. We'd love to see your contribution on the project!

Could you try forking the project and then creating a new branch there to create the PR?

Aug 12 '24 13:08 sungwy

Here's the setup I currently use

Git clone apache/iceberg-python repo (this repo)
Fork the repo (kevinjqliu/iceberg-python)
Add the forked repo as a remote named kevinjqliu (git remote add kevinjqliu [email protected]:kevinjqliu/iceberg-python.git).
Create a new branch (git checkout -b kevinjqliu/issue-1045)
Add changes and commit
Push the branch to the forked repo (git push kevinjqliu)
Come back to this repo's Github and open a PR

It would be good to add this to the Contributing docs :)

Aug 12 '24 23:08 kevinjqliu

@kevinjqliu @sungwy can I help here? I faced the same issue while trying to write a csv into iceberg using pyiceberg/catalog/sql.py

Oct 29 '24 19:10 tusharchou

import os
os.environ['PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE'] = 'true'

I think we can add this to the documentation

Oct 29 '24 19:10 tusharchou

The documentation is at https://py.iceberg.apache.org/configuration/#nanoseconds-support

Do you think there's a better place to signal this to the users?

Oct 31 '24 16:10 kevinjqliu

I've faced the same issue when loading data using Table.add_files method. It fails and shows this error:

TypeError: Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write.

As @fusion2222 already mentioned, setting PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE to 'true' does not help. It's completely ignored since its value is simply not passed to _check_pyarrow_schema_compatible function when using add_files. IMO this is a pretty serious bug, but the fix is super easy (as shown in @fusion2222 patch). Can someone please merge the fix for this? If not I guess I'll try to create a PR for the issue.

Jan 01 '25 09:01 rotem-ad

@rotem-ad i dont see an active PR for this issue. Would you like to open one? Happy to review

Jan 02 '25 21:01 kevinjqliu

@kevinjqliu I followed your instructions on creating a PR

https://github.com/apache/iceberg-python/pull/1569

All credits to @fusion2222 for this patch.

I would love to see this issue fixed ASAP. Thanks

Jan 24 '25 14:01 lloyd-EA

Link mentioned by @lloyd-EA seems not to work. I created another this PR based on my patch from my previous comment as you suggested @sungwy .

https://github.com/apache/iceberg-python/pull/1572

Only thing holding me down was writing unit tests. For a newcomer they seem complicated.

Jan 24 '25 19:01 fusion2222

hey folks, looks like there are currently 2 open PR on this issue

#1569
#1572

Let's standardize on one of them and add a test case (see my comment here)

There's currently a test to show that add_files fails with ns timestamp, so we need to modify this test too https://github.com/apache/iceberg-python/blob/3b53edc2ec523782735e8ddf54119ab8ef3c953d/tests/integration/test_add_files.py#L588

Jan 26 '25 18:01 kevinjqliu

At this point I looked on code 4 hours about how to adjust unit test mentioned by @kevinjqliu. But code is complicated and it is hard to navigate for a newcomer.

Even if I correct exception raising assertion, the old test is still using assertion for warning message printed out to stdout / stderr. This message looks like this:

Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write.`.

This is triggered by pyiceberg.io.pyarrow._ConvertToIceberg.primitive, which is triggered deep inside Table.add_files. Sadly, code does not have sensible comments explaining why this warning is shown. Is this some sort of limit in some 3rd party library? Or something else? I don't really know if it is safe just to remove or what course of action is the best regarding assertion for this warning.

Jan 26 '25 21:01 fusion2222

Hi @fusion2222 and @lloyd-EA - thank you both for jumping on the issue and contributing the PRs! I agree with @kevinjqliu here that it would be good to consolidate our efforts on a single PR before duplicating our efforts further.

I apologize that the purpose of the error message in the _ConvertToIceberg visitor isn't making much sense without additional context. Here's my attempt at explaining its purpose. The original PR to introduce ns downcasting only supports downcasting for write operations, because we are able to set the parquet type to timestamp logical type[1] with the desired downcasted us precision when we rewrite the data files and hence seemed more straight forward to implement.

add_files on the other hand does not rewrite the data file, so if we support adding data files with ns precision, we aren't actually rewriting the data files with us precision, but instead we are keeping the data files with ns precision. Hence we will need to focus on verifying that we are able to read the parquet files that are added with ns precision when the IcebergSchema is of us precision.

Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write. This assertion is thrown within the _ConvertToIceberg visitor that visits a pyarrow schema and converts it to Iceberg schema. It should be silenced when _downcast_ns_timestamp_to_us attribute is set to True on the visitor.

[1] (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)

Jan 28 '25 03:01 sungwy

Thanks @sungwy! I'm working on your suggestion here. Just having a little trouble getting PySpark working on my local env, I'll see if I can fix it.

@fusion2222 I'm happy to merge your branch to mine and continue with my PR or you can merge mine to yours, continue with your PR and decline mine. I really don't mind :)

Jan 28 '25 09:01 lloyd-EA

I will close my PR. The codebase seems not to be newcomer friendly and it seems like @lloyd-EA already has some experience with pyiceberg library.

Jan 29 '25 12:01 fusion2222

Sorry you had that experience @fusion2222 ! There's of course a lot of Iceberg specific context here on this repository, and I'm hoping we can continue to work to build a library that is easy for new contributors to join.

I'm cross-posting our finding in @lloyd-EA 's PR, where we discovered that Spark-Iceberg does in fact have a problem reading ns timestamp data files when the Iceberg table field type is stored as us timestamp in the schema. We will need to investigate the issue on Spark-Iceberg side to understand if this is a bug, or an expected behavior of Spark-Iceberg.

This may also mean that we may only be able to add ns timestamp files to a V3 Spec table

Jan 29 '25 13:01 sungwy

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Jul 29 '25 00:07 github-actions[bot]

Bump. @lloyd-EA is there possible to continue on this?

Jul 30 '25 16:07 fusion2222

As @sungwy already pointed out, add_files does not rewrite the data. This way, we would need to just write the data, and then project this into microseconds when reading, but this is a rather complicated path.

With V3 coming up, I think it makes much more sense to add the nanoseconds timestamp into a V3 TimestampNanoType:

https://github.com/apache/iceberg-python/blob/ccaa15cf8ac1eb1202256d17efc9cf9d20723d20/pyiceberg/types.py#L767-L783

A similar issue has been raised today: https://github.com/apache/iceberg-python/issues/2270

Aug 04 '25 11:08 Fokko

i believe this is resolved by https://github.com/apache/iceberg-python/pull/2294

Oct 26 '25 20:10 kevinjqliu

iceberg-python iceberg-python copied to clipboard

Support for timestamp downcasting when loading data to iceberg tables

Apache Iceberg version

Please describe the bug 🐞

iceberg-python
iceberg-python copied to clipboard