[WIP] HIVE-27473: Rewrite MetaStoreClients to be composable
Design document: Google docs
What changes were proposed in this pull request?
This patch refactors HiveMetaStoreClient and its subclasses in composable classes.
Why are the changes needed?
By making SessionHiveMetaStoreClient composable, this refactoring enables users to create and use custom Catalogs with Hive's session-level features, such as temporary tables.
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
This patch is currently in draft state, so no tests have been added yet.
The content in Google Doc is perfectly aligned with my memory. I'm still reading the proposed changes. I'd appreciate it if the latest dependencies between MetaStoreClients somewhere were written somewhere
@okumin, I have added an Appendix section to the design document, which summarizes the dependencies between MetaStoreClients as well as which classes in the PR codebase provide specific features compared to the master branch. I'm not entirely sure if this aligns with your request, so please let me know if it doesn't address what you were looking for.
@deniskuzZ, Thank you for looking over the PR. I think the suggested class names make sense, and I'll update the PR accordingly.
Regarding HiveMetaStoreClientWithTmpTable, my initial thought was that we should separate features as much as possible to allow custom catalog implementations to selectively include only what they need. However, I have reconsidered my thought and now believe that the features originating from SessionHiveMetaStoreClient and HiveMetaStoreClientWithLocalCache should always wrap the custom catalog. As described in the figure 2 of the design document, the next commit will include four layers (Hook, Session, LocalCache, Thrift) instead of five (Hook, TmpTable, Session, LocalCache, Thrift).
@okumin , @deniskuzZ , @zratkai, thank you for your review. I've updated the branch to address your feedback and improve code quality. The main changes are summarized as follows:
- Renamed the new classes
- Integrated the temporary table layer and session feature layer
- Updated {Session/LocalCaching}MetaStoreClientProxy to handle ValidTxnWriteIdList properly
- Modified SessionMetaStoreClientProxy to set processorCapability in advance, ensuring the cached key matches the actual request
- Changed the access level of TempTable from package-private to public, in order to place all proxy classes under the same directory
There are still several remaining issues, including SG:FIXME markers and some unchecked areas. In the latest commit, I focused on SessionMetaStoreClientProxy and plan to move on to the other proxy classes. I’d appreciate it if you could review the changes and share any considerations I may have missed.
I found Hive 4.0.0 has around 20 deprecated methods. Are we allowed to discontinue them before HIVE-27473? https://github.com/apache/hive/blob/rel/release-4.0.0/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java
Can we raise the proposal to the mailing list and tie-break the best approaches at this point? People in #4444 should be very interested. Then, we can invest our resources in reviewing this pull request.
The Option 2 of my original document might make integration clean. We will add a small change to Apache Hive, and third-party vendors implement the minimal number of methods. However, this approach doesn't allow existing integrations to run as they are.
If we can make Option 1 clean and the community people agree with it, I think many people would be happy. Additionally, the HiveMetaStoreClient family with complicated dependencies will be decomposed!
I found Hive 4.0.0 has around 20 deprecated methods. Are we allowed to discontinue them before HIVE-27473?
I think removing the deprecated methods would be helpful for simplifying the work in HIVE-27473. However, I'm not entirely sure about the standard process for removing deprecated methods in Hive. If their removal is acceptable at this stage, I would be supportive of proceeding in that direction and able to work for it.
Can we raise the proposal to the mailing list and tie-break the best approaches at this point? People in #4444 should be very interested. Then, we can invest our resources in reviewing this pull request.
The Option 2 of my original document might make integration clean. We will add a small change to Apache Hive, and third-party vendors implement the minimal number of methods. However, this approach doesn't allow existing integrations to run as they are.
If we can make Option 1 clean and the community people agree with it, I think many people would be happy. Additionally, the HiveMetaStoreClient family with complicated dependencies will be decomposed!
Sure, it would be nice to make an intermediate design decision at this point. Before raising the proposal on the mailing list, I have a small question about your comment: when you mentioned “existing integrations,” were you referring to integration tests, or to external integrations such as third-party catalogs? I’d like to make sure I fully understand the impact you're referring to, especially in terms of potential behavioral changes.
@ngsg
when you mentioned “existing integrations,” were you referring to integration tests, or to external integrations such as third-party catalogs?
I know Glue Data Catalog depends on the unmerged HIVE-12679, and I can infer that Databricks also uses or has used it from their release notes. Keeping the compatibility might help those users(it is also possible that they prefer a cleaner interface). I will involve them once we start the discussion.
I happened to find Apache Ranger also relies on IMetaStoreClient. I believe HIVE-27473 does not impact the feature, but I am leaving a comment nonetheless.
https://github.com/apache/ranger/blob/master/hive-agent/src/main/java/org/apache/ranger/authorization/hive/authorizer/RangerHiveAuthorizer.java
@okumin, I understand now. Thank you for the explanation. I’ve also started a discussion on the mailing list. It would be great if you could take a look and help enrich it with your insights. Thanks again for your support.
i think we can drop the deprecated methods in 4.1 and right now review and mark methods that we want to discontinue
Hey @ngsg, @okumin, since you are leading the custom catalogs support initiative, are there any blockers, what needs attention/review to push this forward?
@deniskuzZ Hi, thanks for asking. As everyone supports the current option, we can proceed with this. No one has raised specific concerns at this point.
@ngsg Could you rebase this PR? I probably have sufficient time this week and next week.
@okumin, I have rebased the PR.
@deniskuzZ, As far as I know, there are no blockers we need to consider at this point.
My review progress is visible here. https://gist.github.com/okumin/add5b97c3b6674820cf1aad433856e9f
@ngsg @deniskuzZ @dengzhhu653 While reviewing the pull request, I found MetaStoreFilterHook is used in 40+ places, and I guess we no longer need it. Do you think it is a reasonable option to discontinue it immediately?
Reason: MetaStoreFilterHook is declared as LimitedPrivate for Apache Sentry, which is already retired. At least, I don't see any use cases in Apache Hive(and Apache Ranger as well). It would simplify the entire dependency and reduce the risks where inconsistencies can unexpectedly cause a problem.
% git grep metastore.filter.hook
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java: METASTORE_FILTER_HOOK("hive.metastore.filter.hook", "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl",
data/files/rfc5424-hs2.log:<14>1 2019-03-21T07:06:19.799Z hiveserver2-99ddccbd8-49sbv hiveserver2 1 cb6c1a29-4ba7-11e9-83d0-02c17ec35ac0 [mdc@18060 class="metastore.HiveMetaStoreClient" level="INFO" thread="main"] Mestastore configuration metastore.filter.hook changed from org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook to org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl
itests/util/src/main/java/org/apache/hadoop/hive/ql/QTestUtil.java: conf.set("hive.metastore.filter.hook", "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl");
ql/src/test/queries/clientpositive/explainanalyze_3.q:set hive.metastore.filter.hook=org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl;
ql/src/test/queries/clientpositive/explainuser_3.q:set hive.metastore.filter.hook=org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl;
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java: FILTER_HOOK("metastore.filter.hook", "hive.metastore.filter.hook",
%
I suspended the review because there are some cases where SessionMetaStoreClientProxy is unlikely to be invoked. It would be great if we could discontinue MetaStoreFilterHook, or we probably have to fix them a bit.
Sorry, MetaStoreFilterHook is used as HiveMetaStoreAuthorizer or AuthorizationMetaStoreFilterHook. I was searching only for the standalone-metastore project. My bad
Sorry, MetaStoreFilterHook is used as HiveMetaStoreAuthorizer or AuthorizationMetaStoreFilterHook. I was searching only for the standalone-metastore project. My bad
Current we need this in the client side, IMO it's better to put the filter on the ThriftHiveMetaStoreClient to enforce the check
I prefer the way to enhance the client through each lawyer.
I would like to make the chain thread-safe by default(for client in each lawyer) for non-embedded Metastore, especially for the ThriftHiveMetaStoreClient, like:
https://github.com/apache/hive/blob/master/jdbc/src/java/org/apache/hive/jdbc/HiveConnection.java#L2258-L2287
@ngsg @dengzhhu653 @deniskuzZ
I reviewed half of the methods and commented on where I found issues. My impression is that the number of methods is huge, and it is not trivial to find mistakes.
I found we typically have two types of problems.
To avoid them, I am wondering if we can have an abstract class that enforces two rules. This is a sample implementation.
- Rule A: Any subclass of IMetaStoreClient can implement the most generic variant of family methods. For example, if IMetaStoreClient has
getTableMeta(db, tbl, types)andgetTableMeta(catalog, db, tbl, types), vendors must implement onlygetTableMeta(catalog, db, tbl, types) - Rule B: All family methods behave consistently. For example, if IMetaStoreClient has
getTableMeta(db, tbl, types)andgetTableMeta(catalog, db, tbl, types), BOTH handle tmp tables or NEITHER handles tmp tables
From the point of view of vendors or Hive maintainers, they don't want to implement and maintain 5 types of getTable differently. However, we can't immediately discontinue obsolete variants from IMetaStoreClient, as Apache Spark or some projects depend on IMetaStoreClient or vendors really want to implement all variants one by one(maybe, no). That's why I am considering adding the layer as a helper abstract class.
In the new foundation, only the most generic method works as a bridge, and the other variants work just as aliases, which are correct in any layer. We can asynchronously discontinue legacy variants.
P.S. Anyway, I'm still reviewing the current PR, as it might not be easy to reach a consensus to add one more refactoring ticket.
I checked all methods and commented on all places where I had questions
@okumin, @dengzhhu653,
Thank you for your exhaustive reviews. Based on your comments, I summarize the remaining tasks as follows:
- Interface normalization issue
- FilterHook-related issues
- Thread-safety issue (particularly for non-embedded Metastore)
- General issues, including your review comments, failed test cases, and remaining FIXMEs
Unfortunately, I won’t be able to make significant progress on HIVE-27473 in the near future. I’ll do my best to return to it as soon as possible. Thanks again for your valuable reviews.
To avoid them, I am wondering if we can have an abstract class that enforces two rules. This is a sample implementation.
- Rule A: Any subclass of IMetaStoreClient can implement the most generic variant of family methods. For example, if IMetaStoreClient has
getTableMeta(db, tbl, types)andgetTableMeta(catalog, db, tbl, types), vendors must implement onlygetTableMeta(catalog, db, tbl, types)- Rule B: All family methods behave consistently. For example, if IMetaStoreClient has
getTableMeta(db, tbl, types)andgetTableMeta(catalog, db, tbl, types), BOTH handle tmp tables or NEITHER handles tmp tables
I also thought of a similar approach when I was writing the PoC, and I agree that the two rules you stated must be upheld. Let me check the code again and try to implement your idea in the next commit.
@dengzhhu653,
Sorry, MetaStoreFilterHook is used as HiveMetaStoreAuthorizer or AuthorizationMetaStoreFilterHook. I was searching only for the standalone-metastore project. My bad
Current we need this in the client side, IMO it's better to put the filter on the ThriftHiveMetaStoreClient to enforce the check
I think it is OK to place MetaStoreFilterHook on HookMetaStoreClientProxy. We can keep the current authorization step by consistently wrapping the inner layer by HookMetaStoreClientProxy, and this approach can also support future extensions for third-party catalogs. However, I'm open to revisiting this if you still have concerns, as I'm not fully confident in these security related features.
I prefer the way to enhance the client through each lawyer. I would like to make the chain thread-safe by default(for client in each lawyer) for non-embedded Metastore, especially for the
ThriftHiveMetaStoreClient, like: https://github.com/apache/hive/blob/master/jdbc/src/java/org/apache/hive/jdbc/HiveConnection.java#L2258-L2287
In the current implementation, method invocations on any MetaStoreClient connected to a remote MetaStore are guarded by a synchronized block, as in your example. Therefore, I think MetaStoreClient with non-embeded MetaStore is thread-safe.
cf. https://github.com/apache/hive/blob/459b1e5b3870393adae1b7e527ce856fc1d3f162/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L6118-L6120
@okumin,
I introduced a new abstract class NormalizedMetaStoreClient, which eliminates nearly all overloaded method, following the approach you suggested. All IMetaStoreClient implementations, including ThriftHiveMetaStoreClient, now extend NormalizedMetaStoreClient, and much of the redundant logic for setting default catalog has been removed.
Regarding your review comments that I haven't responded to directly, I believe this change addresses the issues raised in them.
(The original review comment is here.)
Thanks. As we are cutting 4.1 and we are moving to 5.0 next, I would say we can remove some deprecated methods.
Thanks. As we are cutting 4.1 and we are moving to 5.0 next, I would say we can remove some deprecated methods.
I got it. Let me make an additional commit that drops some(or possibly all) deprecated methods in IMetaStoreClient and check whether any issues arise.
I have removed 13 deprecated methods from IMetaStoreClient and updated the affected test cases accordingly. There are still 20 deprecated methods remaining, which are widely used in the current codebase. (There is even a dedicated test for deprecated calls, cf. TestStats.) I believe removing those is beyond the scope of this JIRA and would be better handled in a separate ticket.
Awesome job, @ngsg ! left a few comments