flink-kubernetes-operator icon indicating copy to clipboard operation
flink-kubernetes-operator copied to clipboard

[FLINK-31215] [autoscaler] Backpropagate processing rate limits from non-scalable bottlenecks to upstream operators

Open aplyusnin opened this issue 7 months ago • 12 comments

What is the purpose of the change

This pull request adds logic for backpropagating processing rate from non-scalable bottlenecks to upstream operators, potentially reducing parallelism of bakcpressured vertices after scaling.

Brief change log

  • Introduce an option for enabling back propagation checks during autoscaling
  • Update scaling functions to determine potential bottlenecks
  • Scaling of target capacity for each vertex by some coefficient
  • This coefficient is evaluated in the way jobs' bottlenecks are scaled as much as possible, but not exceed max parallelism.

Verifying this change

This change added tests and can be verified as follows:

  • Extended existing tests in JobVertexScalerTest to check updated logic for vertex exclusion and effects of backpropagations scale factor
  • Extended ScalingExecutorTest by tests for testing Backpropagation on different jobs and vertices exclusion.
  • Manually verified on different jobs with different max parallelism configuration causing bottlenecks appearance and with different sets of excluded vertices.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? yes, the brief explanation is here: https://docs.google.com/document/d/1CWT4Q_rv0_adba0nUoSFTpzvz1mzb7diUnGd2w4S074/edit?usp=sharing

aplyusnin avatar Jun 30 '24 19:06 aplyusnin