hudi
hudi copied to clipboard
[HUDI-3654] Add new module `hudi-metaserver`
What is the purpose of the pull request
Add a new module hudi-metastore
Brief change log
HoodieMetastoreBasedTimeline HoodieMetastoreFileSystemView MetaStore has three parts:
- client, connects with server by Thrift
- service, is divided into tableService, partitionService, timelineService and snapshotService
- store, has a relation db based implementation with the power of MyBatis
Writing a commit/deltacommit is available, but read is not ready.
Verify this pull request
For test, metastore will start up with an embedded one.
This change added tests and can be verified as follows:
- Add metastore client ut
- Add metastore store ut
- Add a case of cow writing based on metastore
Committer checklist
-
[ ] Has a corresponding JIRA in PR title & commit
-
[ ] Commit message is descriptive of the change
-
[ ] CI is green
-
[ ] Necessary doc changes done or have another open PR
-
[ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
Add catalogName parameter to MetadataStore interface method
great feature !!!
@xiarixiaoyao : can you review this when you get a chance. I have assigned it to myself as well. So, will try to review in a weeks time.
@minihippo could you pls rebase the code and run azure again, thanks
The CI environment lacks thrift under /usr/local/bin/thrift, so that hudi-metastore can't be compiled. I push the compiled thrift classes as a temporary way to pass CI.
@xiarixiaoyao It seems that the code review doesn't finish, right?
@minihippo Just commented on the deps and code changes on existing files. Could you please share how this is being tested and may be add a small README, for building this PR locally and working through an example end-end?
Trying to understand whats in scope for this PR - entire metaserver module or this is just a start of a series of PRs
Sorry for missing the question. will add a README and it's just a start of a series of PRs. The next one is support snapshot creation.
@hudi-bot run azure
Why do we implement a thrift RPC between the client and server?
Hi @prasannarajaperumal, thanks for reviewing.
GRPC, HTTP are in my list. Between HTTP and RPC, RPC is more efficient that it has compact encoder and reject the redundant protocol design HTTP used. Client and server both have to align with the interface and entity that RPC defined, so it's more controllable and developer shouldn't fully understand the underlateyer data transfer details .For metadata transfer, RPC is much better. Between GRPC and Thrift, thrift is stable, common and widely used in the main open source frameworks. It supports muti-language compatibility and has better performance than GRPC.
We have state stored in file system / relational tables and we can have 2 clear implementations of interfaces that enable both.
Actually, we store all informations in metaserver, i.e. timeline、files info (file name, size, partition belongs to and etc.). The storage of server is optional, it can be relational tables, file system and both, according to the characteristic of the metadata and how we use it.
Why not just use plain JDBC - client/server is built into the JDBC protocol?
Did mean the necessary that i involved MyBatis? It eliminates almost all of the JDBC code and manual setting of parameters and retrieval of results.
- developers only need to focus on the sql logic
- it's convenient for code maintenance
SQL is stored outside of the code, which makes the SQL more reusable and easy to be maintained. MyBatis has good support for dynamic SQL and preventing SQL injection.
@prasannarajaperumal Other than the RPC protocol consideration as @minihippo mentioned, with the generated models, we'll gain flexibilities in adapting to different metastores / catalogs like AWS glue, datahub, etc for sync purpose. I discussed with @minihippo separately on having sth like hudi-metastore-proxy-bundle.jar to sync to those catalogs, which can consolidate the existing sync tools via common standardized models. This hasn't been added to the current RFC doc. @minihippo is working on publishing an updated RFC including all the planned future capabilities.
CI report:
- 53aa21bf23d2f8b0404743e6d016cfb2fac444f7 UNKNOWN
- 049b3baf0decd49a29dd96b73acbba6acb4d7997 Azure: FAILURE
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
Is there a conflict between using metaserver and hudi metadata? For example, using them at the same time Hudi's metadata now supports not only FILES, but also COLUMN_ STATS and BLOOM_ FILTERS. Currently, metaserver only supports list files and list partitions. Will it complement others?
Hi @prasannarajaperumal, according to the comments that interface abstraction, I add the hudi catalog design into RFC-36 design doc https://github.com/apache/hudi/pull/4718.
To speed up the initial pr landing, considering the completed refactor will touch many basic entities and bring a detailed code review and discussion, this pr will only do partial refactor as following. What do you think?

Hi @minihippo Picking this up again. Would appreciate a quick overview of the current status and what functionality is working as of this PR.
@hudi-bot run azure
CI passed