aws-sdk-pandas `athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp

`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`.

Open useredsa opened this issue 1 year ago • 3 comments

Describe the bug

When using s3.to_parquet to update a parquet file that is partitioned by a time interval or a timestamp "attribute" (such as year, month, hour, etc.), the function fails because for this mode the implementation assumes that the values of partition_cols are names of the parquet / table columns, and it does not find something like hour(column) in the dataframe columns.

I think the problem is this line, which uses the function delete_from_iceberg_table, which expects column names.

How to Reproduce

Expected behavior

I expect the partition_cols option to accept anything that can be used to partition a parquet. In particular, anything that is accepted when the argument mode is append or overwrite instead of overwrite_partitions.

Your project

No response

Screenshots

No response

OS

Ubuntu 22.04

Python version

3.10

AWS SDK for pandas version

3.7.3

Additional context

No response

Jun 04 '24 18:06 useredsa

Hey,

Unfortunately, because this implementation of to_iceberg relies on a mesh of Pandas and Athena queries, we can't currently support this option of using a partition transform function with mode="overwrite_partititons". However, we are exploring other APIs for refactoring to_iceberg, such as PyIceberg or other AWS Glue APIs, which would allow us to support this in the future.

Jun 17 '24 15:06 LeonLuttenberger

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

Aug 17 '24 09:08 github-actions[bot]

Hi @LeonLuttenberger,

I understand that it would be difficult to solve the issue, but should it be closed due to inactivity while it is still unsolved?

Aug 19 '24 09:08 useredsa

aws-sdk-pandas aws-sdk-pandas copied to clipboard

`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`.

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

aws-sdk-pandas
aws-sdk-pandas copied to clipboard