flink-kubernetes-operator [FLINK-31215] [autoscaler] Backpropagate processing rate limits from non-scalable bottlenecks to upstream operators

[FLINK-31215] [autoscaler] Backpropagate processing rate limits from non-scalable bottlenecks to upstream operators

Open aplyusnin opened this issue 7 months ago • 12 comments

What is the purpose of the change

This pull request adds logic for backpropagating processing rate from non-scalable bottlenecks to upstream operators, potentially reducing parallelism of bakcpressured vertices after scaling.

Brief change log

Introduce an option for enabling back propagation checks during autoscaling
Update scaling functions to determine potential bottlenecks
Scaling of target capacity for each vertex by some coefficient
This coefficient is evaluated in the way jobs' bottlenecks are scaled as much as possible, but not exceed max parallelism.

Verifying this change

This change added tests and can be verified as follows:

Extended existing tests in JobVertexScalerTest to check updated logic for vertex exclusion and effects of backpropagations scale factor
Extended ScalingExecutorTest by tests for testing Backpropagation on different jobs and vertices exclusion.
Manually verified on different jobs with different max parallelism configuration causing bottlenecks appearance and with different sets of excluded vertices.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? yes, the brief explanation is here: https://docs.google.com/document/d/1CWT4Q_rv0_adba0nUoSFTpzvz1mzb7diUnGd2w4S074/edit?usp=sharing

Jun 30 '24 19:06 aplyusnin

flink-kubernetes-operator flink-kubernetes-operator copied to clipboard

[FLINK-31215] [autoscaler] Backpropagate processing rate limits from non-scalable bottlenecks to upstream operators

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flink-kubernetes-operator
flink-kubernetes-operator copied to clipboard