aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

Wrangler to support Hudi/Iceberg datasets read/write

Open anandshah123 opened this issue 2 years ago • 8 comments

Is your idea related to a problem? Please describe. No

Describe the solution you'd like It would be good to have support for CDC data lake formats like Apache Hudi, Apache Iceberg or Detla Lake format.

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

anandshah123 avatar Jul 22 '22 09:07 anandshah123

Thanks for raising this, two options are currently available in the library to handle CDC operations:

  • AWS Glue Governed Tables
  • Apache Iceberg is natively supported via Athena, meaning you can use existing wr.athena.* methods to create, update and delete Iceberg tables

Delta Lake and Hudi are not on our roadmap at the moment because they lack native support in AWS Glue, that being said PRs are always welcome if you have a specific implementation in mind :)

jaidisido avatar Jul 22 '22 10:07 jaidisido

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Sep 20 '22 12:09 github-actions[bot]

With the release of Glue 4.0, it appears there is "support for Apache Hudi, Apache Iceberg, and Delta Lake formats" with AWS Glue. Will this make implementing this feature possible to implement now?

AdrianoNicolucci avatar Nov 29 '22 01:11 AdrianoNicolucci

  • Apache Iceberg is natively supported via Athena, meaning you can use existing wr.athena.* methods to create, update and delete Iceberg tables

@jaidisido is there any way to use wr.athena.* to update a Iceberg table with a pandas DataFrame then? In the docs I can only see examples for reading DataFrames...

cdelamocepsa avatar Jan 05 '23 14:01 cdelamocepsa

When https://github.com/apache/iceberg/issues/6564 is implemented might be possible to write in Iceberg format natively using python, without any help from external processing systems like Spark/Athena/Trino.

nicor88 avatar Feb 13 '23 14:02 nicor88

Without Athena, could we have a more seamless integration for Wrangler on all transactional formats e.g.

  • wr.create_table (format='hudi' ...)
  • wr.create_table (format='iceberg' ...)
  • wr.create_table (format='deltalake' ...)

apopata-aws avatar May 17 '23 15:05 apopata-aws

HI @cdelamocepsa it is now possible to write into Iceberg using Athena since release 3.1: Athena Iceberg tutorial.

kukushking avatar Jun 08 '23 13:06 kukushking

HI @cdelamocepsa it is now possible to write into Iceberg since release 3.1: Athena Iceberg tutorial.

@kukushking I see that you need to specify a temp_path, what I'm supposing is that this method writes the data in a temporary glue table and then makes the insert into the Iceberg table from the temp table.

I'm concerned about the efficiency of this, do you have any inputs in how will it behave in terms of latency/cost?

cdelamocepsa avatar Jun 08 '23 13:06 cdelamocepsa