iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Support iceberg hadoop catalog in python library

Open Fokko opened this issue 2 years ago • 5 comments

Feature Request / Improvement

Migrated ticket https://github.com/apache/iceberg/issues/3220

Check the original ticket for details.

Fokko avatar Oct 02 '23 09:10 Fokko

We use hadoop catalog on a fs with atomic move support. Would you accept a contributed hadoop catalog to pyiceberg?

brianfromoregon avatar Feb 09 '24 18:02 brianfromoregon

@Fokko forgot the ping. Thanks

brianfromoregon avatar Feb 14 '24 00:02 brianfromoregon

This would really help us out, where we use Hadoop catalog for unit testing PySpark code, and are increasingly encountering cases where we want to test code that uses both pyiceberg and pyspark and expects them to share the same catalog.

corleyma avatar Feb 24 '24 00:02 corleyma

Hey @brianfromoregon We try to avoid implementing the Hadoop catalog in PyIceberg. It is a different implementation than the other catalogs since conflict detection relies on the atomic renames of HDFS.

@corleyma Have you tried running a simple REST catalog similar to what we do in the PyIceberg test setup?

Fokko avatar Feb 28 '24 08:02 Fokko

@Fokko We do a setup similar to this for integration tests, but the ability to write faster unit tests that depend only on a temp directory fixture in pytest has been great for our PySpark code.

We had separately been using an InMemoryCatalog for unit tests of certain pyiceberg code, but now that we have more functions comingling pyspark and pyiceberg (ddl/metadata manipulation in pyiceberg), we are running into the limits of pyiceberg not supporting Hadoop catalog.

I would love if we could add a file system catalog to PyIceberg compatible with PySpark and HadoopCatalog. It could be named and documented whichever way is needed to ensure folks know it's not a production catalog, but I think it's legitimately useful for testing purposes.

corleyma avatar Feb 29 '24 07:02 corleyma

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Aug 28 '24 00:08 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Sep 11 '24 00:09 github-actions[bot]