iceberg
iceberg copied to clipboard
Support for AWS Glue as an alternative Hive metastore implementation
Similar to the functionality in Presto I was wondering if Glue can be substituted in as an alternative implementation of a Hive metastore. Looking at the current HiveTableOperations
it relies on:
get table
create table
alter table
an exclusive lock
The locking mechanism would be the problematic part as I don't believe an equivalent API is available in Glue. Possibly there's another approach or another service could be used for the locking functionality e.g. DynamoDB.
I thought Glue exposed the same Thrift API that Hive uses. If that's the case, then we should be able to use the same lock API and code.
I believe the API is partially implemented and doesn't include locking mechanisms unfortunately. Looking into it a bit when running on Spark EMR for instance, the HiveMetaStoreClientFactory
can be overridden to specify AWSGlueDataCatalogHiveClientFactory
see here. The implementation used there implements the basic Hive metastore operations e.g. create/alter/get table (calling back to the Glue public API) but UnsupportedOperationException
is thrown for the lock
method.
So, I was thinking the lock piece could be abstracted out where the generic Hive implementation uses the lock
method via the Hive metastore but then a Glue override could use some other mechanism. So I guess mainly at this point it's a limitation of the Glue implementation but wanted to toss this out there as a nice to have for people not running their own Hive metastore.
The client source was made available for Glue now for reference, see announcement. AWSCatalogMetastoreClient
implements Hive's IMetaStoreClient
and delegates to the GlueMetastoreClientDelegate
although this only implements a subset of functionality so lock for instance just throws an unsupported operation exception here
I think that Glue should implement locking as required by the interface it exposes. I'd be fine adding a solution specific to Glue in Iceberg as well, but I'm not sure what that would look like. Good to know that Glue won't work though.
Looking into it a bit when running on Spark EMR for instance
I believe there is ongoing work to have the HiveMetaStoreClientFactory abstraction contributed to vanilla Apache Hive:
https://issues.apache.org/jira/browse/HIVE-12679
On Fri, 7 Dec 2018 at 21:07, Ryan Rupp [email protected] wrote:
I believe the API is partially implemented and doesn't include locking mechanisms unfortunately. Looking into it a bit when running on Spark EMR for instance, the HiveMetaStoreClientFactory can be overridden to specify AWSGlueDataCatalogHiveClientFactory see here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html. The implementation used there implements the basic Hive metastore operations e.g. create/alter/get table (calling back to the Glue public API) but UnsupportedOperationException is thrown for the lock method.
So, I was thinking the lock piece could be abstracted out where the generic Hive implementation uses the lock method via the Hive metastore but then a Glue override could use some other mechanism. So I guess mainly at this point it's a limitation of the Glue implementation but wanted to toss this out there as a nice to have for people not running their own Hive metastore.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Netflix/iceberg/issues/112#issuecomment-445365616, or mute the thread https://github.com/notifications/unsubscribe-auth/AAN-VqlejBd-TXdAUcUiyB5amA3-XdOJks5u2tiEgaJpZM4Y4lVs .