iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

any plan for Iceberg Table on S3?

Open Lindayangyy opened this issue 5 years ago • 16 comments

New to Apache Iceberg, We are looking for Iceberg Table or warehouse (catalog) implementation upon S3, if without any reference to Hive and HDFS (hadoop) is possible? The current implementation seems tightly coupled with Hive and hadoop.

Lindayangyy avatar Sep 16 '20 19:09 Lindayangyy

You can use it with S3 with Hadoop client libraries only, you don't actually need a Hadoop cluster or HDFS.

RussellSpitzer avatar Sep 16 '20 19:09 RussellSpitzer

Supporting S3 requires Hive, because of S3's characteristic, eventual consistency. I see OSP version of Delta Lake solved it in different way, but pretty much limited. (It assumes concurrent writes for S3 only happen in "a" Spark driver. https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/storage/S3SingleDriverLogStore.scala)

HeartSaVioR avatar Sep 16 '20 22:09 HeartSaVioR

Iceberg works reliably with s3 even if the same table is accessed via multiple clusters and query engines. Using Iceberg requires a catalog that can swap a pointer to the metadata file atomically. This can be done using a compare and swap or lock/unlock API. Iceberg contains a built-in implementation that uses Hive metastore to work with s3 reliably (lock/unlock). Anyone could easily build an integration for any catalog. For example, one may have a Cassandra-based catalog and use compare and swap to commit new table versions. That will be enough to work with s3 reliably.

aokolnychyi avatar Sep 16 '20 22:09 aokolnychyi

We've been working on a non-Hive way to provide this functionality and plan on contributing it to the project within the next two weeks.

jacques-n avatar Sep 16 '20 22:09 jacques-n

That will be awesome, can't wait to see it. Thank you - jacques-n!

Lindayangyy avatar Sep 16 '20 22:09 Lindayangyy

Thanks for all the responses as alternatives. All answers are great!

Lindayangyy avatar Sep 16 '20 22:09 Lindayangyy

That sounds great! Assuming it still needs to do CAS with external storage (I'd be really curious if it doesn't rely on the external storage) which is that? Is it one of AWS services? If then even better, as there's no external dependency outside of AWS. Given we assume to use S3, which is already locked-in.

HeartSaVioR avatar Sep 16 '20 22:09 HeartSaVioR

We're doing something pluggable but the default implementation is on top of DynamoDB.

jacques-n avatar Sep 17 '20 00:09 jacques-n

is it possible to write JDBC based catalog? that could unlock many catalog option

ismailsimsek avatar Sep 20 '20 20:09 ismailsimsek

We're doing something pluggable but the default implementation is on top of DynamoDB.

That's a good idea. I know that AWS Glue is backed by DynamoDB, so if you can make a catalog using Dynamo, then possibly the AWS team can implement the atomic swap in Glue. If I'm not mistaken, you'd need to use either read / write consistency or possibly a DynamoDB versioned object.

Looking forward to seeing the DynamoDB catalog as I assume many companies looking to write to S3 are also likely using DynamoDB. I know that my company uses DynamoDB a ton so this would be a great work around until there is Glue Catalog support (which I've been giving some thought to myself).

kbendick avatar Sep 30 '20 04:09 kbendick

Hi @jacques-n this is Jack from AWS. We are planning to introduce a new iceberg-aws module, and we do have plan to offer a Glue + DynamoDB implementation for Catalog and TableOperations. Since you say you already have something working, let's have a sync after you have a PR and see what is the best way to have this shipped all together 😃

jackye1995 avatar Oct 01 '20 02:10 jackye1995

Hey guys, we just posted more information on the new stuff we've been building for Iceberg + DynamoDB. You can check it out here: https://projectnessie.org/

We'll have a PR up against Iceberg shortly to contribute the Iceberg integrations: https://github.com/projectnessie/nessie/tree/main/clients/iceberg

jacques-n avatar Oct 01 '20 21:10 jacques-n

Very cool!

On Thu, Oct 1, 2020 at 4:34 PM Jacques Nadeau [email protected] wrote:

Hey guys, we just posted more information on the new stuff we've been building for Iceberg + DynamoDB. You can check it out here: https://projectnessie.org/

We'll have a PR up against Iceberg shortly to contribute the Iceberg integrations: https://github.com/projectnessie/nessie/tree/main/clients/iceberg

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/apache/iceberg/issues/1468#issuecomment-702410473, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADE2YKA6G5T55NR2OUSRVLSITYWXANCNFSM4RPIUBNQ .

RussellSpitzer avatar Oct 01 '20 21:10 RussellSpitzer

I just sent out a PR for AWS Glue support. With this update you can use HiveCatalog without the need to set up any Hive infrastructure and build your data lake on top of S3. #1608

jackye1995 avatar Oct 13 '20 18:10 jackye1995

For anyone new to this issue, I think we have summarized all information in https://iceberg.apache.org/aws/, and we can close this issue. @Lindayangyy

jackye1995 avatar Jun 21 '21 16:06 jackye1995

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Feb 25 '24 00:02 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Mar 11 '24 00:03 github-actions[bot]