hive icon indicating copy to clipboard operation
hive copied to clipboard

[WIP] HIVE-27473: Rewrite MetaStoreClients to be composable

Open ngsg opened this issue 8 months ago • 25 comments

Design document: Google docs

What changes were proposed in this pull request?

This patch refactors HiveMetaStoreClient and its subclasses in composable classes.

Why are the changes needed?

By making SessionHiveMetaStoreClient composable, this refactoring enables users to create and use custom Catalogs with Hive's session-level features, such as temporary tables.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

This patch is currently in draft state, so no tests have been added yet.

ngsg avatar Apr 14 '25 08:04 ngsg

The content in Google Doc is perfectly aligned with my memory. I'm still reading the proposed changes. I'd appreciate it if the latest dependencies between MetaStoreClients somewhere were written somewhere

okumin avatar Apr 14 '25 13:04 okumin

@okumin, I have added an Appendix section to the design document, which summarizes the dependencies between MetaStoreClients as well as which classes in the PR codebase provide specific features compared to the master branch. I'm not entirely sure if this aligns with your request, so please let me know if it doesn't address what you were looking for.

ngsg avatar Apr 15 '25 07:04 ngsg

@deniskuzZ, Thank you for looking over the PR. I think the suggested class names make sense, and I'll update the PR accordingly.

Regarding HiveMetaStoreClientWithTmpTable, my initial thought was that we should separate features as much as possible to allow custom catalog implementations to selectively include only what they need. However, I have reconsidered my thought and now believe that the features originating from SessionHiveMetaStoreClient and HiveMetaStoreClientWithLocalCache should always wrap the custom catalog. As described in the figure 2 of the design document, the next commit will include four layers (Hook, Session, LocalCache, Thrift) instead of five (Hook, TmpTable, Session, LocalCache, Thrift).

ngsg avatar Apr 15 '25 07:04 ngsg

@okumin , @deniskuzZ , @zratkai, thank you for your review. I've updated the branch to address your feedback and improve code quality. The main changes are summarized as follows:

  • Renamed the new classes
  • Integrated the temporary table layer and session feature layer
  • Updated {Session/LocalCaching}MetaStoreClientProxy to handle ValidTxnWriteIdList properly
  • Modified SessionMetaStoreClientProxy to set processorCapability in advance, ensuring the cached key matches the actual request
  • Changed the access level of TempTable from package-private to public, in order to place all proxy classes under the same directory

There are still several remaining issues, including SG:FIXME markers and some unchecked areas. In the latest commit, I focused on SessionMetaStoreClientProxy and plan to move on to the other proxy classes. I’d appreciate it if you could review the changes and share any considerations I may have missed.

ngsg avatar Apr 22 '25 12:04 ngsg

I found Hive 4.0.0 has around 20 deprecated methods. Are we allowed to discontinue them before HIVE-27473? https://github.com/apache/hive/blob/rel/release-4.0.0/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java

okumin avatar Apr 23 '25 11:04 okumin

Can we raise the proposal to the mailing list and tie-break the best approaches at this point? People in #4444 should be very interested. Then, we can invest our resources in reviewing this pull request.

The Option 2 of my original document might make integration clean. We will add a small change to Apache Hive, and third-party vendors implement the minimal number of methods. However, this approach doesn't allow existing integrations to run as they are.

If we can make Option 1 clean and the community people agree with it, I think many people would be happy. Additionally, the HiveMetaStoreClient family with complicated dependencies will be decomposed!

okumin avatar Apr 23 '25 11:04 okumin

I found Hive 4.0.0 has around 20 deprecated methods. Are we allowed to discontinue them before HIVE-27473?

I think removing the deprecated methods would be helpful for simplifying the work in HIVE-27473. However, I'm not entirely sure about the standard process for removing deprecated methods in Hive. If their removal is acceptable at this stage, I would be supportive of proceeding in that direction and able to work for it.

ngsg avatar Apr 24 '25 05:04 ngsg

Can we raise the proposal to the mailing list and tie-break the best approaches at this point? People in #4444 should be very interested. Then, we can invest our resources in reviewing this pull request.

The Option 2 of my original document might make integration clean. We will add a small change to Apache Hive, and third-party vendors implement the minimal number of methods. However, this approach doesn't allow existing integrations to run as they are.

If we can make Option 1 clean and the community people agree with it, I think many people would be happy. Additionally, the HiveMetaStoreClient family with complicated dependencies will be decomposed!

Sure, it would be nice to make an intermediate design decision at this point. Before raising the proposal on the mailing list, I have a small question about your comment: when you mentioned “existing integrations,” were you referring to integration tests, or to external integrations such as third-party catalogs? I’d like to make sure I fully understand the impact you're referring to, especially in terms of potential behavioral changes.

ngsg avatar Apr 24 '25 05:04 ngsg

@ngsg

when you mentioned “existing integrations,” were you referring to integration tests, or to external integrations such as third-party catalogs?

I know Glue Data Catalog depends on the unmerged HIVE-12679, and I can infer that Databricks also uses or has used it from their release notes. Keeping the compatibility might help those users(it is also possible that they prefer a cleaner interface). I will involve them once we start the discussion.

okumin avatar Apr 28 '25 08:04 okumin

I happened to find Apache Ranger also relies on IMetaStoreClient. I believe HIVE-27473 does not impact the feature, but I am leaving a comment nonetheless.

https://github.com/apache/ranger/blob/master/hive-agent/src/main/java/org/apache/ranger/authorization/hive/authorizer/RangerHiveAuthorizer.java

okumin avatar Apr 28 '25 08:04 okumin

@okumin, I understand now. Thank you for the explanation. I’ve also started a discussion on the mailing list. It would be great if you could take a look and help enrich it with your insights. Thanks again for your support.

ngsg avatar Apr 28 '25 10:04 ngsg

i think we can drop the deprecated methods in 4.1 and right now review and mark methods that we want to discontinue

deniskuzZ avatar May 09 '25 11:05 deniskuzZ

Hey @ngsg, @okumin, since you are leading the custom catalogs support initiative, are there any blockers, what needs attention/review to push this forward?

deniskuzZ avatar May 09 '25 11:05 deniskuzZ

@deniskuzZ Hi, thanks for asking. As everyone supports the current option, we can proceed with this. No one has raised specific concerns at this point.

@ngsg Could you rebase this PR? I probably have sufficient time this week and next week.

okumin avatar May 12 '25 09:05 okumin

@okumin, I have rebased the PR.

@deniskuzZ, As far as I know, there are no blockers we need to consider at this point.

ngsg avatar May 12 '25 12:05 ngsg

My review progress is visible here. https://gist.github.com/okumin/add5b97c3b6674820cf1aad433856e9f

@ngsg @deniskuzZ @dengzhhu653 While reviewing the pull request, I found MetaStoreFilterHook is used in 40+ places, and I guess we no longer need it. Do you think it is a reasonable option to discontinue it immediately?

Reason: MetaStoreFilterHook is declared as LimitedPrivate for Apache Sentry, which is already retired. At least, I don't see any use cases in Apache Hive(and Apache Ranger as well). It would simplify the entire dependency and reduce the risks where inconsistencies can unexpectedly cause a problem.

% git grep metastore.filter.hook       
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:    METASTORE_FILTER_HOOK("hive.metastore.filter.hook", "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl",
data/files/rfc5424-hs2.log:<14>1 2019-03-21T07:06:19.799Z hiveserver2-99ddccbd8-49sbv hiveserver2 1 cb6c1a29-4ba7-11e9-83d0-02c17ec35ac0 [mdc@18060 class="metastore.HiveMetaStoreClient" level="INFO" thread="main"] Mestastore configuration metastore.filter.hook changed from org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook to org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl
itests/util/src/main/java/org/apache/hadoop/hive/ql/QTestUtil.java:    conf.set("hive.metastore.filter.hook", "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl");
ql/src/test/queries/clientpositive/explainanalyze_3.q:set hive.metastore.filter.hook=org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl;
ql/src/test/queries/clientpositive/explainuser_3.q:set hive.metastore.filter.hook=org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl;
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java:    FILTER_HOOK("metastore.filter.hook", "hive.metastore.filter.hook",
%

okumin avatar May 14 '25 05:05 okumin

I suspended the review because there are some cases where SessionMetaStoreClientProxy is unlikely to be invoked. It would be great if we could discontinue MetaStoreFilterHook, or we probably have to fix them a bit.

okumin avatar May 14 '25 06:05 okumin

Sorry, MetaStoreFilterHook is used as HiveMetaStoreAuthorizer or AuthorizationMetaStoreFilterHook. I was searching only for the standalone-metastore project. My bad

okumin avatar May 14 '25 09:05 okumin

Sorry, MetaStoreFilterHook is used as HiveMetaStoreAuthorizer or AuthorizationMetaStoreFilterHook. I was searching only for the standalone-metastore project. My bad

Current we need this in the client side, IMO it's better to put the filter on the ThriftHiveMetaStoreClient to enforce the check

dengzhhu653 avatar May 15 '25 03:05 dengzhhu653

I prefer the way to enhance the client through each lawyer. I would like to make the chain thread-safe by default(for client in each lawyer) for non-embedded Metastore, especially for the ThriftHiveMetaStoreClient, like: https://github.com/apache/hive/blob/master/jdbc/src/java/org/apache/hive/jdbc/HiveConnection.java#L2258-L2287

dengzhhu653 avatar May 15 '25 03:05 dengzhhu653

@ngsg @dengzhhu653 @deniskuzZ

I reviewed half of the methods and commented on where I found issues. My impression is that the number of methods is huge, and it is not trivial to find mistakes.

I found we typically have two types of problems.

image

To avoid them, I am wondering if we can have an abstract class that enforces two rules. This is a sample implementation.

  • Rule A: Any subclass of IMetaStoreClient can implement the most generic variant of family methods. For example, if IMetaStoreClient has getTableMeta(db, tbl, types) and getTableMeta(catalog, db, tbl, types), vendors must implement only getTableMeta(catalog, db, tbl, types)
  • Rule B: All family methods behave consistently. For example, if IMetaStoreClient has getTableMeta(db, tbl, types) and getTableMeta(catalog, db, tbl, types), BOTH handle tmp tables or NEITHER handles tmp tables

From the point of view of vendors or Hive maintainers, they don't want to implement and maintain 5 types of getTable differently. However, we can't immediately discontinue obsolete variants from IMetaStoreClient, as Apache Spark or some projects depend on IMetaStoreClient or vendors really want to implement all variants one by one(maybe, no). That's why I am considering adding the layer as a helper abstract class.

In the new foundation, only the most generic method works as a bridge, and the other variants work just as aliases, which are correct in any layer. We can asynchronously discontinue legacy variants.

image

P.S. Anyway, I'm still reviewing the current PR, as it might not be easy to reach a consensus to add one more refactoring ticket.

okumin avatar May 15 '25 06:05 okumin

I checked all methods and commented on all places where I had questions

okumin avatar May 15 '25 10:05 okumin

@okumin, @dengzhhu653,

Thank you for your exhaustive reviews. Based on your comments, I summarize the remaining tasks as follows:

  1. Interface normalization issue
  2. FilterHook-related issues
  3. Thread-safety issue (particularly for non-embedded Metastore)
  4. General issues, including your review comments, failed test cases, and remaining FIXMEs

Unfortunately, I won’t be able to make significant progress on HIVE-27473 in the near future. I’ll do my best to return to it as soon as possible. Thanks again for your valuable reviews.

ngsg avatar May 21 '25 15:05 ngsg

To avoid them, I am wondering if we can have an abstract class that enforces two rules. This is a sample implementation.

  • Rule A: Any subclass of IMetaStoreClient can implement the most generic variant of family methods. For example, if IMetaStoreClient has getTableMeta(db, tbl, types) and getTableMeta(catalog, db, tbl, types), vendors must implement only getTableMeta(catalog, db, tbl, types)
  • Rule B: All family methods behave consistently. For example, if IMetaStoreClient has getTableMeta(db, tbl, types) and getTableMeta(catalog, db, tbl, types), BOTH handle tmp tables or NEITHER handles tmp tables

I also thought of a similar approach when I was writing the PoC, and I agree that the two rules you stated must be upheld. Let me check the code again and try to implement your idea in the next commit.

ngsg avatar May 21 '25 15:05 ngsg

@dengzhhu653,

Sorry, MetaStoreFilterHook is used as HiveMetaStoreAuthorizer or AuthorizationMetaStoreFilterHook. I was searching only for the standalone-metastore project. My bad

Current we need this in the client side, IMO it's better to put the filter on the ThriftHiveMetaStoreClient to enforce the check

I think it is OK to place MetaStoreFilterHook on HookMetaStoreClientProxy. We can keep the current authorization step by consistently wrapping the inner layer by HookMetaStoreClientProxy, and this approach can also support future extensions for third-party catalogs. However, I'm open to revisiting this if you still have concerns, as I'm not fully confident in these security related features.


I prefer the way to enhance the client through each lawyer. I would like to make the chain thread-safe by default(for client in each lawyer) for non-embedded Metastore, especially for the ThriftHiveMetaStoreClient, like: https://github.com/apache/hive/blob/master/jdbc/src/java/org/apache/hive/jdbc/HiveConnection.java#L2258-L2287

In the current implementation, method invocations on any MetaStoreClient connected to a remote MetaStore are guarded by a synchronized block, as in your example. Therefore, I think MetaStoreClient with non-embeded MetaStore is thread-safe.

cf. https://github.com/apache/hive/blob/459b1e5b3870393adae1b7e527ce856fc1d3f162/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L6118-L6120

ngsg avatar Jun 23 '25 07:06 ngsg

@okumin,

I introduced a new abstract class NormalizedMetaStoreClient, which eliminates nearly all overloaded method, following the approach you suggested. All IMetaStoreClient implementations, including ThriftHiveMetaStoreClient, now extend NormalizedMetaStoreClient, and much of the redundant logic for setting default catalog has been removed. Regarding your review comments that I haven't responded to directly, I believe this change addresses the issues raised in them.

(The original review comment is here.)

ngsg avatar Jun 23 '25 08:06 ngsg

Thanks. As we are cutting 4.1 and we are moving to 5.0 next, I would say we can remove some deprecated methods.

okumin avatar Jun 23 '25 13:06 okumin

Thanks. As we are cutting 4.1 and we are moving to 5.0 next, I would say we can remove some deprecated methods.

I got it. Let me make an additional commit that drops some(or possibly all) deprecated methods in IMetaStoreClient and check whether any issues arise.

ngsg avatar Jun 24 '25 01:06 ngsg

I have removed 13 deprecated methods from IMetaStoreClient and updated the affected test cases accordingly. There are still 20 deprecated methods remaining, which are widely used in the current codebase. (There is even a dedicated test for deprecated calls, cf. TestStats.) I believe removing those is beyond the scope of this JIRA and would be better handled in a separate ticket.

ngsg avatar Jun 25 '25 05:06 ngsg

Awesome job, @ngsg ! left a few comments

deniskuzZ avatar Jun 27 '25 14:06 deniskuzZ