aws-sdk-pandas Iceberg partitioning based on transformed DataFrame columns not supported?

Iceberg partitioning based on transformed DataFrame columns not supported?

Open petebachant opened this issue 1 year ago • 1 comments

Describe the bug

I was hoping that something like partition_cols=["user_id", "month(ts)"] would work nicely in athena.to_iceberg, but I end up with a KeyError noting that "month(ts)" doesn't exist in the DataFrame. However, the table is created properly based on the output of athena.show_create_table.

How to Reproduce

Use a partitioning function in one of the partition_cols passed into to_iceberg.

Expected behavior

I expect to not get a KeyError.

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.11

AWS SDK for pandas version

3.6.0

Additional context

No response

Feb 22 '24 21:02 petebachant

Hi @petebachant unfortunately only partition columns are supported at the moment (no functions). You can transform the data in your data frame prior to the insert.

Feb 26 '24 18:02 kukushking

Hi @kukushking that's not what documentation says:

https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.athena.to_iceberg.html#awswrangler.athena.to_iceberg

I was using append mode a couple of days ago successfully on version 3.4, now it fails on version 3.6, I tried to check the wrangler code but I do not totally understand why I did not get error in 3.4 since the code involved in this error seems not to be updated since 3.4 version

Feb 29 '24 15:02 alvaro-ponce

Ok, it was 3.3.0, I have checked and I can append multiple times with function in partition cols in 3.3.0 version, not 3.4 as I said before.

I believe the problem appeared in version 3.4.1 with the addition of _determine_differences function in _write_iceberg.py script.

Feb 29 '24 16:02 alvaro-ponce

Hey, thanks for bringing this up. You are correct, this feature was broken in the PR you tagged. I'm looking into whether we can fix this feature. If not, I will update the documentation.

Feb 29 '24 17:02 LeonLuttenberger

Thanks for the quick response and work!

You are awesome!

Feb 29 '24 18:02 alvaro-ponce

aws-sdk-pandas aws-sdk-pandas copied to clipboard

Iceberg partitioning based on transformed DataFrame columns not supported?

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

aws-sdk-pandas
aws-sdk-pandas copied to clipboard