hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-3016][RFC-43] Proposal to implement Table Service Manager

Open yuzhaojing opened this issue 3 years ago • 13 comments

…ce for Hudi

Tips

  • Thank you very much for contributing to Apache Hudi.
  • Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • [ ] Has a corresponding JIRA in PR title & commit

  • [ ] Commit message is descriptive of the change

  • [ ] CI is green

  • [ ] Necessary doc changes done or have another open PR

  • [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

yuzhaojing avatar Dec 14 '21 12:12 yuzhaojing

cc @prashantwason @nbalajee and folks at uber, who are looking into similar things.

vinothchandar avatar Mar 10 '22 04:03 vinothchandar

@yuzhaojing Can we fold this under the #4718 proposal? It would be awesome to move

  • Metadata
  • Table service management
  • Locking

into a single, highly available metadata layer! This is very exciting, for many reasons, that I can elaborate on

vinothchandar avatar Mar 10 '22 04:03 vinothchandar

cc @minihippo what do you think?

vinothchandar avatar Mar 10 '22 04:03 vinothchandar

@prashantwason @nsivabalan Thanks for the review, I'll be updating the RFC next week, looking forward to more comments from you.

yuzhaojing avatar Mar 16 '22 15:03 yuzhaojing

Unless I am missing something here, can't the "storage" + "scheduler" + execute pieces, directly reuse what we have already.

just writing down ideas in my head to see how we are different.

a) Introduce a config hoodie.skip.table.services to HoodieWriteConfig which will make all writers skip any scheduling + execution of table services, if true. Writers throw an error if a lock provider is not configured and hoodie.skip.table.services=true

b) TableManagenent server takes as input the uris of the Hudi metastore (so we have a clean dependency. Writers talk to metastore, the table management service picks up tables from metastore) or if we don't want to require the metastore, then we need to register tables as described here already. This can be a CLI command or an idempotent call from a write client. (I am fine either way)

c) Then, for every write/commit on the table, the table management server is notified. Need some poll I think, or since push can be lost (we could do a smarter hybrid?). In response, the table management server will schedule relevant table services, right onto the table's timeline and notify a separate execution component/thread can start executing it.

d) We still need to find a way to do HA here. Unlike the metastore, we need to shard by table here, since we probably want just one server to do the scheduling/execution for every table?

Let me know If that makes sense !

I think we need to implement the storage and the scheduling part, but the execution part can directly reuse what we already have, such as HoodieCompactor and HoodieClusteringJob.

a) Totally agree, that's exactly what I think.

b) When using Metastore, my idea is to sense the table through callback. When the hoodie table commit instant, the TableManagenent hook in the Metastore triggers the scheduling of table services corresponding to the hoodie table. Using this method can avoid Pressure from TableManagenent listing. If we don't want to use the metastore, then we need to support CLI commands or idempotent calls from the write client, as it is possible to change the registration information without restarting the task, such as queue.

c) Totally agree, we can push only when Metastore is not enabled, and list the corresponding hoodie table when TableManagenent has not received notification for a long time.

d) I think we need stateless multi-instances to handle requests for all tables, each instance only handles incoming requests, we can use an optimistic mechanism to prevent concurrent scheduling or execution, we can put this in phase2.

yuzhaojing avatar May 07 '22 10:05 yuzhaojing

In a production environment with limited resources, the concept of priority is required, and different priorities are required for different tables and different actions. Should we provide an interface of schedulation strategy and use FIFO as default?

yuzhaojing avatar May 11 '22 05:05 yuzhaojing

@vinothchandar @prashantwason Would like to hear your thoughts on this!

yuzhaojing avatar May 11 '22 05:05 yuzhaojing

CI report:

  • fbe27691b5d9de58128cc58158047a4df2b53750 UNKNOWN
  • 5125110c970ec0e22d3497e4bb3b65a8216a9f8d Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar May 18 '22 02:05 hudi-bot

Taking this over, given @prashantwason is now on a break.

vinothchandar avatar Jun 08 '22 04:06 vinothchandar

One high level comment/ask. not sure if its doable already. can regular writers take care of scheduling table services by themselves and only delegate execution to TMS? I am thinking from a deltastreamer standpoint. As of now, Hudi could achieve async non blocking compaction/clustering, bcoz, regular writer will take care of scheduling (while table is frozen (no other writes in flight)) and delegate the execution to a separate thread. if we delegate the scheduling also to TMS, we can't guarantee that there won't be any other inflight operation while scheduling compaction (on which case, scheduling could fail). So, instead of a separate thread executing(as of master), we can let TMS handle the execution alone. This will keep the deltastreamer job lean (less resources), while TMS takes up heavy compute job of executing the table service.

From what I glean, only if scheduling is done by TMS, it goes into its backing storage which the scheduler will be polling. if scheduling is done by regular writer, not sure how this will be handled. sorry, I haven't taken a look at the impl yet.

Is this functionality supported? if not, we should look to support this.

nsivabalan avatar Oct 05 '22 14:10 nsivabalan

Infact, this could also be a problem w/ other writes as well. when scheduling compaction, hudi expects to not have any other delta commits in flight. So, better to let one of the writers to take care of scheduling and delegate the execution to TMS. probably this is specific to compaction. Wanted to remind just incase.

nsivabalan avatar Oct 05 '22 15:10 nsivabalan

if we delegate the scheduling also to TMS, we can't guarantee that there won't be any other inflight operation while scheduling compaction

@nsivabalan TMS is a downstream listener to metaserver (or any timeline server used if no metaserver) so TMS is aware of all inflight commits on the registered tables, and we should use that info to generate plans.

can regular writers take care of scheduling table services by themselves and only delegate execution to TMS?

Let's clarify: TMS is responsible for plan generation and scheduling, while execution is send to a separate cluster, and TMS is monitoring the execution. You're proposing to basically delegating execution, it's possible but not really making use of TMS capabilities. I'd suggest full delegation to TMS; we're starting a server, so making full use of it will justify the design and the cost.

xushiyan avatar Oct 27 '22 08:10 xushiyan

@xushiyan @yuzhaojing @danny0405 Hi, Can version 1.0 support this feature? This feature is very necessary. Please push forward the progress.

zyclove avatar Dec 24 '23 03:12 zyclove