iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Support for AWS Glue as an alternative Hive metastore implementation

Open ryanrupp opened this issue 6 years ago • 5 comments

Similar to the functionality in Presto I was wondering if Glue can be substituted in as an alternative implementation of a Hive metastore. Looking at the current HiveTableOperations it relies on:

get table
create table
alter table
an exclusive lock

The locking mechanism would be the problematic part as I don't believe an equivalent API is available in Glue. Possibly there's another approach or another service could be used for the locking functionality e.g. DynamoDB.

ryanrupp avatar Nov 28 '18 22:11 ryanrupp

I thought Glue exposed the same Thrift API that Hive uses. If that's the case, then we should be able to use the same lock API and code.

rdblue avatar Dec 07 '18 17:12 rdblue

I believe the API is partially implemented and doesn't include locking mechanisms unfortunately. Looking into it a bit when running on Spark EMR for instance, the HiveMetaStoreClientFactory can be overridden to specify AWSGlueDataCatalogHiveClientFactory see here. The implementation used there implements the basic Hive metastore operations e.g. create/alter/get table (calling back to the Glue public API) but UnsupportedOperationException is thrown for the lock method.

So, I was thinking the lock piece could be abstracted out where the generic Hive implementation uses the lock method via the Hive metastore but then a Glue override could use some other mechanism. So I guess mainly at this point it's a limitation of the Glue implementation but wanted to toss this out there as a nice to have for people not running their own Hive metastore.

ryanrupp avatar Dec 07 '18 21:12 ryanrupp

The client source was made available for Glue now for reference, see announcement. AWSCatalogMetastoreClient implements Hive's IMetaStoreClient and delegates to the GlueMetastoreClientDelegate although this only implements a subset of functionality so lock for instance just throws an unsupported operation exception here

ryanrupp avatar Feb 12 '19 18:02 ryanrupp

I think that Glue should implement locking as required by the interface it exposes. I'd be fine adding a solution specific to Glue in Iceberg as well, but I'm not sure what that would look like. Good to know that Glue won't work though.

rdblue avatar Feb 13 '19 01:02 rdblue

Looking into it a bit when running on Spark EMR for instance

I believe there is ongoing work to have the HiveMetaStoreClientFactory abstraction contributed to vanilla Apache Hive:

https://issues.apache.org/jira/browse/HIVE-12679

On Fri, 7 Dec 2018 at 21:07, Ryan Rupp [email protected] wrote:

I believe the API is partially implemented and doesn't include locking mechanisms unfortunately. Looking into it a bit when running on Spark EMR for instance, the HiveMetaStoreClientFactory can be overridden to specify AWSGlueDataCatalogHiveClientFactory see here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html. The implementation used there implements the basic Hive metastore operations e.g. create/alter/get table (calling back to the Glue public API) but UnsupportedOperationException is thrown for the lock method.

So, I was thinking the lock piece could be abstracted out where the generic Hive implementation uses the lock method via the Hive metastore but then a Glue override could use some other mechanism. So I guess mainly at this point it's a limitation of the Glue implementation but wanted to toss this out there as a nice to have for people not running their own Hive metastore.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Netflix/iceberg/issues/112#issuecomment-445365616, or mute the thread https://github.com/notifications/unsubscribe-auth/AAN-VqlejBd-TXdAUcUiyB5amA3-XdOJks5u2tiEgaJpZM4Y4lVs .

teabot avatar Feb 13 '19 09:02 teabot