What is the purpose of the pull request

A new rfc for hudi metastore server

Committer checklist

[ ] Has a corresponding JIRA in PR title & commit
[ ] Commit message is descriptive of the change
[ ] CI is green
[ ] Necessary doc changes done or have another open PR
[ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Jan 29 '22 14:01 minihippo

CI report:

3208c9fe7de1c45e12a07debdeaa30239aff23aa Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Jan 29 '22 16:01 hudi-bot

@minihippo Picking this back up again. What are the next steps in our plan here?

Mar 10 '22 19:03 vinothchandar

@minihippo Picking this back up again. What are the next steps in our plan here?

@vinothchandar Thanks for the review,

More details for RFC
I will submit a pr about the initial hudi-metastore module which supports basic functions next week

Mar 12 '22 02:03 minihippo

@minihippo Sounds good! We can revisit once you have the basic PR out

Mar 30 '22 23:03 vinothchandar

@minihippo This is a great work👍, I think it can also solve the problem I recently met: HUDI-3634 as we keep commit instants consistent in the hudi metastore server.

But I'm curious how spark side get metadata of a hudi table(stored in the hudi metastore server) and a hive table (stored in the HMS) in one query(like a hudi table join a hive table)? Will we handle this in the HudiCatalog to get hudi table metadata from hudi metastore server and hive table from HMS, or we provide a unified view in the hudi metastore server, let hudi metastore to request HMS server if it's a hive table?

Mar 31 '22 11:03 boneanxs

Very valuable idea!

Further, maybe we can do more interesting things based on this very valuable hudi metastore server, which is beneficial to realize Hudi Lake Manager which could decouple hudi ingestion and hudi table service, including cleaner, archival, clustering, compaction and any table services in the feature.

And this lake manager could unify and automatically call put services such as cleaner/clustering/compaction/archive(multi-writer and async) based on this metastore server.

Users only need to care about their own ingest pipline and leave all the table services to the manager to automatically discover and manage the hudi table thereby greatly reducing the pressure of operation and maintenance and the cost of on board.

Maybe We could expand this RFC or raising a new RFC and take this MTS as informations inputs?

CC @yihua and @nsivabalan

Apr 18 '22 06:04 zhangyue19921010

@minihippo This is a great work👍, I think it can also solve the problem I recently met: HUDI-3634 as we keep commit instants consistent in the hudi metastore server.

But I'm curious how spark side get metadata of a hudi table(stored in the hudi metastore server) and a hive table (stored in the HMS) in one query(like a hudi table join a hive table)? Will we handle this in the HudiCatalog to get hudi table metadata from hudi metastore server and hive table from HMS, or we provide a unified view in the hudi metastore server, let hudi metastore to request HMS server if it's a hive table?

@boneanxs In ByteDance in house implementation, we do more like the second way. There is a proxy over the hudi metastore server and hive metastore server. The proxy routes requests to the corresponding server according to the table type.

Apr 25 '22 17:04 minihippo

Very valuable idea!

Further, maybe we can do more interesting things based on this very valuable hudi metastore server, which is beneficial to realize Hudi Lake Manager which could decouple hudi ingestion and hudi table service, including cleaner, archival, clustering, compaction and any table services in the feature.

And this lake manager could unify and automatically call put services such as cleaner/clustering/compaction/archive(multi-writer and async) based on this metastore server.

Users only need to care about their own ingest pipline and leave all the table services to the manager to automatically discover and manage the hudi table thereby greatly reducing the pressure of operation and maintenance and the cost of on board.

Maybe We could expand this RFC or raising a new RFC and take this MTS as informations inputs?

CC @yihua and @nsivabalan

@zhangyue19921010 https://github.com/apache/hudi/pull/4309 here it is.

Apr 25 '22 17:04 minihippo

Yeap, I read https://github.com/apache/hudi/pull/4309 RFC. What i am thinking is that could we expand this scope. Maybe is more common infrastructure not only clustering/compaction but also clean, archive and any other service in the future :)

Apr 26 '22 06:04 zhangyue19921010

@zhangyue19921010 Yes. It's on the list. Hi @yuzhaojing could u supply this part in the RFC?

Apr 26 '22 12:04 minihippo

On this RFC, I think the main thing is to decide the first phase scope. IMO, it can be limited to just Hudi tables for now and depending on whether a hudi.metastore.uris is configured or not, the queries will use this metaserver or not.

Does the RFC address high availability/sharding of metadata? Have you thought about these? If the metastore will also deal with locks, then the servers will become stateful. May be we can phase them as well? @minihippo thoughts?

Apr 26 '22 23:04 vinothchandar

@vinothchandar sorry for replying the comments so late. When design the storage schema of metadata store, tbl_id is in each storage table so that metadata could be sharded by tbl_id, and all metadata of a table is in one shard. There are no problems about joining across the shard.

Jun 07 '22 14:06 minihippo

Short-term plan (target 1.0)

Phase1

Implement the basic functions

Databases and tables store
All actions (i.e. commit, compaction) and operations (i.e. upsert, compact, cluster)
Timeline, instant meta store.
Partition, snapshot store.
Spark/ Flink read/write available based on metastore
Parameters of table/partition level persistence.
e.g. table config

Phase2

Extensions

Schema store and support schema evolution
Concurrency support (will submit a new rfc)
Hudi catalog

Jun 07 '22 14:06 minihippo

hudi
hudi copied to clipboard

[HUDI-3345][RFC-36] Hudi metastore server

What is the purpose of the pull request

Committer checklist

CI report:

Short-term plan (target 1.0)

Phase1

Phase2

hudi hudi copied to clipboard

[HUDI-3345][RFC-36] Hudi metastore server

What is the purpose of the pull request

Committer checklist

CI report:

Short-term plan (target 1.0)

Phase1

Phase2

hudi
hudi copied to clipboard