aws-sdk-pandas
aws-sdk-pandas copied to clipboard
Iceberg partitioning based on transformed DataFrame columns not supported?
Describe the bug
I was hoping that something like partition_cols=["user_id", "month(ts)"]
would work nicely in athena.to_iceberg
, but I end up with a KeyError
noting that "month(ts)"
doesn't exist in the DataFrame. However, the table is created properly based on the output of athena.show_create_table
.
How to Reproduce
Use a partitioning function in one of the partition_cols
passed into to_iceberg
.
Expected behavior
I expect to not get a KeyError
.
Your project
No response
Screenshots
No response
OS
Mac
Python version
3.11
AWS SDK for pandas version
3.6.0
Additional context
No response
Hi @petebachant unfortunately only partition columns are supported at the moment (no functions). You can transform the data in your data frame prior to the insert.
Hi @kukushking that's not what documentation says:
https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.athena.to_iceberg.html#awswrangler.athena.to_iceberg
I was using append mode a couple of days ago successfully on version 3.4, now it fails on version 3.6, I tried to check the wrangler code but I do not totally understand why I did not get error in 3.4 since the code involved in this error seems not to be updated since 3.4 version
Ok, it was 3.3.0, I have checked and I can append multiple times with function in partition cols in 3.3.0 version, not 3.4 as I said before.
I believe the problem appeared in version 3.4.1 with the addition of _determine_differences function in _write_iceberg.py script.
Hey, thanks for bringing this up. You are correct, this feature was broken in the PR you tagged. I'm looking into whether we can fix this feature. If not, I will update the documentation.
Thanks for the quick response and work!
You are awesome!