flink
flink copied to clipboard
WIP: [FLINK-XXXXX] Task local recovery for the reactive mode.
WIP
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community to review your pull request. We will use this comment to track the progress of the review.
Automated Checks
Last check on commit ba54b091609f7e51792a911192e89e0eb7b6d1d7 (Wed Dec 29 16:50:01 UTC 2021)
Warnings:
- No documentation files were touched! Remember to keep the Flink docs up to date!
- Invalid pull request title: No valid Jira ID provided
Mention the bot in a comment to re-run the automated checks.
Review Progress
- ❓ 1. The [description] looks good.
- ❓ 2. There is [consensus] that the contribution should go into to Flink.
- ❓ 3. Needs [attention] from.
- ❓ 4. The change fits into the overall [architecture].
- ❓ 5. Overall code [quality] is good.
Please see the Pull Request Review Guide for a full explanation of the review process.Bot commands
The @flinkbot bot supports the following commands:
-
@flinkbot approve description
to approve one or more aspects (aspects:description
,consensus
,architecture
andquality
) -
@flinkbot approve all
to approve all aspects -
@flinkbot approve-until architecture
to approve everything untilarchitecture
-
@flinkbot attention @username1 [@username2 ..]
to require somebody's attention -
@flinkbot disapprove architecture
to remove an approval you gave earlier
CI report:
- 62dfe5d08014b63db77e75e21d17cb22d1692848 Azure: FAILURE
Bot commands
The @flinkbot bot supports the following commands:-
@flinkbot run azure
re-run the last Azure build
@tillrohrmann Thanks for the first pass on the overall direction!
I was thinking about the algorithmic complexity and I think we can do some optimizations by leveraging few facts:
-
For assigning slots to "execution groups": We treat each vertex in a slot equally. Considering the simple case where all vertices have the same parallelism, we can simply count the "weight" / "# of key groups" of a single vertex. 🤔
-
For splitting slots to slot sharing groups: We can leverage the fact that they had a disjoint set of slots in the previous run.
I'll probably continue with finishing the test suite first and then we can focus on the performance. But something along these lines might do the trick.
I'm also thinking whether it would make a difference if we also leverage TM locality 🤔 (this could be also a follow up)
Interesting feature! Thanks for your great working. @dmvk
Do we have a jira issue to track? Or any detailed description or FLIP. I’m interested on it about task recovery and scheduling.
Hi @zuston , this has been roughly outlined in https://issues.apache.org/jira/browse/FLINK-21450, but there is currently no detailed description / FLIP. This was meant as a PoC, but I think I already have a better idea on how this should be implemented so I can try to derive some outlines from that.
Capacity-wise, I probably won't to be able to finish this for 1.15, but this will most likely be really high on priority list for the next release cycle.