flink WIP: [FLINK-XXXXX] Task local recovery for the reactive mode.

WIP

Dec 29 '21 16:12 dmvk

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit ba54b091609f7e51792a911192e89e0eb7b6d1d7 (Wed Dec 29 16:50:01 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!
Invalid pull request title: No valid Jira ID provided

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

Dec 29 '21 16:12 flinkbot

CI report:

62dfe5d08014b63db77e75e21d17cb22d1692848 Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

Dec 29 '21 16:12 flinkbot

@tillrohrmann Thanks for the first pass on the overall direction!

I was thinking about the algorithmic complexity and I think we can do some optimizations by leveraging few facts:

For assigning slots to "execution groups": We treat each vertex in a slot equally. Considering the simple case where all vertices have the same parallelism, we can simply count the "weight" / "# of key groups" of a single vertex. 🤔
For splitting slots to slot sharing groups: We can leverage the fact that they had a disjoint set of slots in the previous run.

I'll probably continue with finishing the test suite first and then we can focus on the performance. But something along these lines might do the trick.

I'm also thinking whether it would make a difference if we also leverage TM locality 🤔 (this could be also a follow up)

Dec 30 '21 08:12 dmvk

Interesting feature! Thanks for your great working. @dmvk

Do we have a jira issue to track? Or any detailed description or FLIP. I’m interested on it about task recovery and scheduling.

Jan 11 '22 14:01 zuston

Hi @zuston , this has been roughly outlined in https://issues.apache.org/jira/browse/FLINK-21450, but there is currently no detailed description / FLIP. This was meant as a PoC, but I think I already have a better idea on how this should be implemented so I can try to derive some outlines from that.

Capacity-wise, I probably won't to be able to finish this for 1.15, but this will most likely be really high on priority list for the next release cycle.

Jan 11 '22 14:01 dmvk

flink flink copied to clipboard

WIP: [FLINK-XXXXX] Task local recovery for the reactive mode.

Automated Checks

Review Progress

CI report:

flink
flink copied to clipboard