flinkk8soperator
flinkk8soperator copied to clipboard
Scaling Orchestration
Hi,
I have a question regarding scaling the parallelism value.
For now, we are doing the manual way, whereby we edit the FlinkApplication YAML to change the parallelism value, apply the YAML, and let the operator transition from the old job cluster to the newly created job cluster.
Is there any recommended way to automatically orchestrate the above, similar to how HPA works in K8S, whereby HPA will kick in once the metrics exceed a certain value?
I am not aware of something that is available (or straightforward) to achieve this. We have wanted to build some functionality like this in the Operator, we are yet to design it.
Hi, we would also be interested into an auto-scaling feature of the operator, since we often need to solve the problem of too low resources, resulting in high back-pressure and a need for a manual action.
Do you have some design yet? Or would you prefer to collaborate on this design?
One idea is to compare Kafka message timestamp (or custom message payload timestamp) with current time, which has some obvious drawbacks, of course, but as an MVP I think it's excellent, as it would be optional.
- the flink application would have to publish a metric, which computes the time difference
- the k8s operator would just fetch this metric and would up-scale the deployment if needed
- another issue is that it would have no way of knowing when to down-scale, but as an MVP, it's good (note that in our environment, the amount of processed messages consistently increases, so in our case,
up-scale onlyauto-scale would be much appreciated)
It seems this thread is out of date. Since May, 2021, the Flink team has introduced Reactive Mode with Flink v1.13. This supports horizontal pod autoscaling as long as it is supported within your Kubernetes cluster (metrics-server deployed as well).
If I'm thinking through this correctly, we already have access to the Flink configuration updates to enable reactive mode:
scheduler-mode: reactive
and by deploying a container with Flink v1.13+. The only additional need would be to include HPA with the Operator when it creates the necessary resources, and any rbac associated with creating that.
I can test appending HPA on after the fact, I haven't yet. But this should be enough in theory to get this working unless there are other Operator considerations that need to be made for managing the deployment with HPA.
More info on Reactive Mode.
@JRemitz Sounds good. I guess there is one scenario, the deployment object created by the operator will need min/max. no ?
Yes, really the entire HPA definition for CPU/memory rules. I'm not overly familiar with the CRD structure. Can you reference existing API objects as complex structures for validation?
Check this PR for reference: https://github.com/lyft/flinkk8soperator/pull/233/files
https://github.com/lyft/flinkk8soperator/blob/master/deploy/crd.yaml