hudi [HUDI-3654] Add new module `hudi-metaserver`

trafficstars

What is the purpose of the pull request

Add a new module hudi-metastore

Brief change log

HoodieMetastoreBasedTimeline HoodieMetastoreFileSystemView MetaStore has three parts：

client, connects with server by Thrift
service, is divided into tableService, partitionService, timelineService and snapshotService
store, has a relation db based implementation with the power of MyBatis

Writing a commit/deltacommit is available, but read is not ready.

Verify this pull request

For test, metastore will start up with an embedded one.

This change added tests and can be verified as follows:

Add metastore client ut
Add metastore store ut
Add a case of cow writing based on metastore

Committer checklist

[ ] Has a corresponding JIRA in PR title & commit
[ ] Commit message is descriptive of the change
[ ] CI is green
[ ] Necessary doc changes done or have another open PR
[ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Mar 18 '22 01:03 minihippo

Add catalogName parameter to MetadataStore interface method

Mar 18 '22 05:03 melin

great feature !!!

Mar 18 '22 08:03 xiarixiaoyao

@xiarixiaoyao : can you review this when you get a chance. I have assigned it to myself as well. So, will try to review in a weeks time.

Apr 15 '22 20:04 nsivabalan

@minihippo could you pls rebase the code and run azure again, thanks

Apr 16 '22 02:04 xiarixiaoyao

The CI environment lacks thrift under /usr/local/bin/thrift, so that hudi-metastore can't be compiled. I push the compiled thrift classes as a temporary way to pass CI.

May 18 '22 14:05 minihippo

@xiarixiaoyao It seems that the code review doesn't finish, right?

May 30 '22 07:05 minihippo

@minihippo Just commented on the deps and code changes on existing files. Could you please share how this is being tested and may be add a small README, for building this PR locally and working through an example end-end?

Trying to understand whats in scope for this PR - entire metaserver module or this is just a start of a series of PRs

Sorry for missing the question. will add a README and it's just a start of a series of PRs. The next one is support snapshot creation.

Jul 18 '22 15:07 minihippo

@hudi-bot run azure

Jul 20 '22 15:07 minihippo

Why do we implement a thrift RPC between the client and server?

Hi @prasannarajaperumal, thanks for reviewing.

GRPC, HTTP are in my list. Between HTTP and RPC, RPC is more efficient that it has compact encoder and reject the redundant protocol design HTTP used. Client and server both have to align with the interface and entity that RPC defined, so it's more controllable and developer shouldn't fully understand the underlateyer data transfer details .For metadata transfer, RPC is much better. Between GRPC and Thrift, thrift is stable, common and widely used in the main open source frameworks. It supports muti-language compatibility and has better performance than GRPC.

Sep 06 '22 17:09 minihippo

We have state stored in file system / relational tables and we can have 2 clear implementations of interfaces that enable both.

Actually, we store all informations in metaserver, i.e. timeline、files info (file name, size, partition belongs to and etc.). The storage of server is optional, it can be relational tables, file system and both, according to the characteristic of the metadata and how we use it.

Sep 06 '22 17:09 minihippo

Why not just use plain JDBC - client/server is built into the JDBC protocol?

Did mean the necessary that i involved MyBatis? It eliminates almost all of the JDBC code and manual setting of parameters and retrieval of results.

developers only need to focus on the sql logic
it's convenient for code maintenance

SQL is stored outside of the code, which makes the SQL more reusable and easy to be maintained. MyBatis has good support for dynamic SQL and preventing SQL injection.

Sep 06 '22 17:09 minihippo

@prasannarajaperumal Other than the RPC protocol consideration as @minihippo mentioned, with the generated models, we'll gain flexibilities in adapting to different metastores / catalogs like AWS glue, datahub, etc for sync purpose. I discussed with @minihippo separately on having sth like hudi-metastore-proxy-bundle.jar to sync to those catalogs, which can consolidate the existing sync tools via common standardized models. This hasn't been added to the current RFC doc. @minihippo is working on publishing an updated RFC including all the planned future capabilities.

Sep 07 '22 04:09 xushiyan

CI report:

53aa21bf23d2f8b0404743e6d016cfb2fac444f7 UNKNOWN
049b3baf0decd49a29dd96b73acbba6acb4d7997 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Sep 25 '22 11:09 hudi-bot

Is there a conflict between using metaserver and hudi metadata? For example, using them at the same time Hudi's metadata now supports not only FILES, but also COLUMN_ STATS and BLOOM_ FILTERS. Currently, metaserver only supports list files and list partitions. Will it complement others?

Sep 30 '22 06:09 Zouxxyy

Hi @prasannarajaperumal, according to the comments that interface abstraction, I add the hudi catalog design into RFC-36 design doc https://github.com/apache/hudi/pull/4718. To speed up the initial pr landing, considering the completed refactor will touch many basic entities and bring a detailed code review and discussion, this pr will only do partial refactor as following. What do you think? hudicatalog

Oct 19 '22 02:10 minihippo

Hi @minihippo Picking this up again. Would appreciate a quick overview of the current status and what functionality is working as of this PR.

Nov 28 '22 14:11 vinothchandar

@hudi-bot run azure

Jan 09 '23 03:01 minihippo

Screen Shot 2023-01-15 at 11 42 32 PM

CI passed

Jan 16 '23 05:01 xushiyan

hudi hudi copied to clipboard

[HUDI-3654] Add new module `hudi-metaserver`

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

CI report:

hudi
hudi copied to clipboard