iceberg-python
iceberg-python copied to clipboard
Support for timestamp downcasting when loading data to iceberg tables
Apache Iceberg version
0.7.0 (latest release)
Please describe the bug 🐞
As of release 0.7.0, pyiceberg tables support new data-loading method pyiceberg.table.Table.add_files(). However, this method currently does not respect well documented setting downcast-ns-timestamp-to-us-on-write.
Setting downcast-ns-timestamp-to-us-on-write is always defaulted to False if Table.add_files() is used. No matter if config file explicitly specifies downcast-ns-timestamp-to-us-on-write: "True".
Env variable PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE is not respected as well.
I also have patch for this ready, however it seems like I have no permissions to push a new branch and to create PR
diff --git a/pyiceberg/io/pyarrow.py b/pyiceberg/io/pyarrow.py
index b99c3b1..153b8a5 100644
--- a/pyiceberg/io/pyarrow.py
+++ b/pyiceberg/io/pyarrow.py
@@ -2303,6 +2303,8 @@ def _check_pyarrow_schema_compatible(
def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_paths: Iterator[str]) -> Iterator[DataFile]:
+ from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE
+
for file_path in file_paths:
input_file = io.new_input(file_path)
with input_file.open() as input_stream:
@@ -2313,7 +2315,12 @@ def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_
f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids"
)
schema = table_metadata.schema()
- _check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema())
+ downcast_ns_timestamp_to_us = Config().get_bool(DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE) or False
+ _check_pyarrow_schema_compatible(
+ schema,
+ parquet_metadata.schema.to_arrow_schema(),
+ downcast_ns_timestamp_to_us
+ )
statistics = data_file_statistics_from_parquet_metadata(
parquet_metadata=parquet_metadata,
Hi @fusion2222 thank you for raising this issue. We'd love to see your contribution on the project!
Could you try forking the project and then creating a new branch there to create the PR?
Here's the setup I currently use
- Git clone
apache/iceberg-pythonrepo (this repo) - Fork the repo (
kevinjqliu/iceberg-python) - Add the forked repo as a remote named
kevinjqliu(git remote add kevinjqliu [email protected]:kevinjqliu/iceberg-python.git). - Create a new branch (
git checkout -b kevinjqliu/issue-1045) - Add changes and commit
- Push the branch to the forked repo (
git push kevinjqliu) - Come back to this repo's Github and open a PR
It would be good to add this to the Contributing docs :)
@kevinjqliu @sungwy can I help here? I faced the same issue while trying to write a csv into iceberg using pyiceberg/catalog/sql.py
import os
os.environ['PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE'] = 'true'
I think we can add this to the documentation
The documentation is at https://py.iceberg.apache.org/configuration/#nanoseconds-support
Do you think there's a better place to signal this to the users?
I've faced the same issue when loading data using Table.add_files method. It fails and shows this error:
TypeError: Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write.
As @fusion2222 already mentioned, setting PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE to 'true' does not help. It's completely ignored since its value is simply not passed to _check_pyarrow_schema_compatible function when using add_files.
IMO this is a pretty serious bug, but the fix is super easy (as shown in @fusion2222 patch).
Can someone please merge the fix for this? If not I guess I'll try to create a PR for the issue.
@rotem-ad i dont see an active PR for this issue. Would you like to open one? Happy to review
@kevinjqliu I followed your instructions on creating a PR
https://github.com/apache/iceberg-python/pull/1569
All credits to @fusion2222 for this patch.
I would love to see this issue fixed ASAP. Thanks
Link mentioned by @lloyd-EA seems not to work. I created another this PR based on my patch from my previous comment as you suggested @sungwy .
https://github.com/apache/iceberg-python/pull/1572
Only thing holding me down was writing unit tests. For a newcomer they seem complicated.
hey folks, looks like there are currently 2 open PR on this issue
- #1569
- #1572
Let's standardize on one of them and add a test case (see my comment here)
There's currently a test to show that add_files fails with ns timestamp, so we need to modify this test too
https://github.com/apache/iceberg-python/blob/3b53edc2ec523782735e8ddf54119ab8ef3c953d/tests/integration/test_add_files.py#L588
At this point I looked on code 4 hours about how to adjust unit test mentioned by @kevinjqliu. But code is complicated and it is hard to navigate for a newcomer.
Even if I correct exception raising assertion, the old test is still using assertion for warning message printed out to stdout / stderr. This message looks like this:
Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write.`.
This is triggered by pyiceberg.io.pyarrow._ConvertToIceberg.primitive, which is triggered deep inside Table.add_files. Sadly, code does not have sensible comments explaining why this warning is shown. Is this some sort of limit in some 3rd party library? Or something else? I don't really know if it is safe just to remove or what course of action is the best regarding assertion for this warning.
Hi @fusion2222 and @lloyd-EA - thank you both for jumping on the issue and contributing the PRs! I agree with @kevinjqliu here that it would be good to consolidate our efforts on a single PR before duplicating our efforts further.
I apologize that the purpose of the error message in the _ConvertToIceberg visitor isn't making much sense without additional context. Here's my attempt at explaining its purpose. The original PR to introduce ns downcasting only supports downcasting for write operations, because we are able to set the parquet type to timestamp logical type[1] with the desired downcasted us precision when we rewrite the data files and hence seemed more straight forward to implement.
add_files on the other hand does not rewrite the data file, so if we support adding data files with ns precision, we aren't actually rewriting the data files with us precision, but instead we are keeping the data files with ns precision. Hence we will need to focus on verifying that we are able to read the parquet files that are added with ns precision when the IcebergSchema is of us precision.
Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write. This assertion is thrown within the _ConvertToIceberg visitor that visits a pyarrow schema and converts it to Iceberg schema. It should be silenced when _downcast_ns_timestamp_to_us attribute is set to True on the visitor.
[1] (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)
Thanks @sungwy! I'm working on your suggestion here. Just having a little trouble getting PySpark working on my local env, I'll see if I can fix it.
@fusion2222 I'm happy to merge your branch to mine and continue with my PR or you can merge mine to yours, continue with your PR and decline mine. I really don't mind :)
I will close my PR. The codebase seems not to be newcomer friendly and it seems like @lloyd-EA already has some experience with pyiceberg library.
Sorry you had that experience @fusion2222 ! There's of course a lot of Iceberg specific context here on this repository, and I'm hoping we can continue to work to build a library that is easy for new contributors to join.
I'm cross-posting our finding in @lloyd-EA 's PR, where we discovered that Spark-Iceberg does in fact have a problem reading ns timestamp data files when the Iceberg table field type is stored as us timestamp in the schema. We will need to investigate the issue on Spark-Iceberg side to understand if this is a bug, or an expected behavior of Spark-Iceberg.
This may also mean that we may only be able to add ns timestamp files to a V3 Spec table
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
Bump. @lloyd-EA is there possible to continue on this?
As @sungwy already pointed out, add_files does not rewrite the data. This way, we would need to just write the data, and then project this into microseconds when reading, but this is a rather complicated path.
With V3 coming up, I think it makes much more sense to add the nanoseconds timestamp into a V3 TimestampNanoType:
https://github.com/apache/iceberg-python/blob/ccaa15cf8ac1eb1202256d17efc9cf9d20723d20/pyiceberg/types.py#L767-L783
A similar issue has been raised today: https://github.com/apache/iceberg-python/issues/2270
i believe this is resolved by https://github.com/apache/iceberg-python/pull/2294