aws-sdk-pandas
aws-sdk-pandas copied to clipboard
`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`.
Describe the bug
When using s3.to_parquet to update a parquet file that is partitioned by a time interval or a timestamp "attribute" (such as year, month, hour, etc.), the function fails because for this mode the implementation assumes that the values of partition_cols are names of the parquet / table columns, and it does not find something like hour(column) in the dataframe columns.
I think the problem is this line, which uses the function delete_from_iceberg_table, which expects column names.
How to Reproduce
Expected behavior
I expect the partition_cols option to accept anything that can be used to partition a parquet. In particular, anything that is accepted when the argument mode is append or overwrite instead of overwrite_partitions.
Your project
No response
Screenshots
No response
OS
Ubuntu 22.04
Python version
3.10
AWS SDK for pandas version
3.7.3
Additional context
No response
Hey,
Unfortunately, because this implementation of to_iceberg relies on a mesh of Pandas and Athena queries, we can't currently support this option of using a partition transform function with mode="overwrite_partititons". However, we are exploring other APIs for refactoring to_iceberg, such as PyIceberg or other AWS Glue APIs, which would allow us to support this in the future.
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.
Hi @LeonLuttenberger,
I understand that it would be difficult to solve the issue, but should it be closed due to inactivity while it is still unsolved?