pulumi-kubernetes-operator icon indicating copy to clipboard operation
pulumi-kubernetes-operator copied to clipboard

Improve architecture for horizontal scaling

Open spender0 opened this issue 2 years ago • 5 comments

Hello!

  • Vote on this issue by adding a 👍 reaction
  • If you want to implement this feature, comment to let us know (we'll work with you on design, scheduling, etc.)

Issue details

Hello Pulumi team! I've been using Pulumi for years and recently started using Pulumi Kubernetes Operator. Having 40+ stacks based on the same typescript npm project taken care of by one Pulumi operator installation I found design problems in the operator.

When it runs npm install and Pulumi code for several stacks it consumes a lot of CPU and memory. But this happens only after git changes. So most of the time operator pod is doing nothing when there are no git changes. But I need to have it with proper CPU and Memory requests set to avoid OOM Kill. So the pod's resources are underutilized. It is burning money most of the time.

Screenshot 2023-11-03 at 12 50 09 PM

The problem is partly related to https://github.com/pulumi/pulumi-kubernetes-operator/issues/368 When I set little resources operator got OOMKilled during infra provisioning and the stack state file is locked by concurrent update.

In addition to the resource problem, it is not possible to scale up the operator deployment horizontally to increase the speed of syncing the big number of stacks. Only one pod can work on stacks at one moment, for this reason, there is k8s lease locking.

As a solution, I would decouple the "npm install" and "pulumi up" functionality from the operator pod into a worker pod so the operator could assign the worker pod onto one stack individually to provision it and once the stack is done the worker should die to save costs. The operator pod should be only a controller for stacks and worker pods. This would make Pulumi Operator more scalable to suit big platforms having hundreds or thousands of stacks.

I would be glad to provide additional information, just let me know.

Affected area/feature

spender0 avatar Nov 03 '23 11:11 spender0

cc @rquitales @EronWright for you awareness.

mikhailshilkov avatar Nov 03 '23 12:11 mikhailshilkov

Note that this is similar to (or potentially ultimately the same as) what’s discussed in https://github.com/pulumi/pulumi-kubernetes-operator/issues/78 (run the deployments as Jobs)

https://github.com/pulumi/pulumi-kubernetes-operator/issues/434 Is another even more extreme option for separating the deployments from the operator compute (running them in Pulumi Deployments instead of directly inside the cluster).

lukehoban avatar Feb 07 '24 14:02 lukehoban

How is concurrency limited when handling stacks all changing at the same time? If OP has 40+ stacks and they're all being refreshed/updated at the same time, would some simple concurrency controls smooth the spike out over a longer time?

danielloader avatar Feb 10 '24 11:02 danielloader

How is concurrency limited when handling stacks all changing at the same time? If OP has 40+ stacks and they're all being refreshed/updated at the same time, would some simple concurrency controls smooth the spike out over a longer time?

I set MAX_CONCURRENT_RECONCILES variable in the operator pod to 4. If I set a higher value, e.g 10, the operator will consume way more resources and will be OOM-killed unless I dedicate even more memory to the pod. This will lead to money burning as most of the time the pod is doing nothing as there are no changes in the stacks.

If I leave MAX_CONCURRENT_RECONCILES=4 the update is too slow when all stacks receive a change.

spender0 avatar Feb 10 '24 11:02 spender0

Good to know for someone new to the operator, was just thinking out loud about the concurrency but it makes sense if the update is too slow too. Given those requirements it does feel like pushing those sessions out to Job pods so they can be on demand distributed out to the wider cluster makes sense.

danielloader avatar Feb 10 '24 11:02 danielloader

Added to epic https://github.com/pulumi/pulumi-kubernetes-operator/issues/586

cleverguy25 avatar Aug 02 '24 18:08 cleverguy25

Good news everyone, we just release a preview of Pulumi Kubernetes Operator v2. This new release has a whole-new architecture that provides much better horizontal scalability.

Please read the announcement blog post for more information: https://www.pulumi.com/blog/pulumi-kubernetes-operator-2-0/

Would love to hear your feedback! Feel free to engage with us on the #kubernetes channel of the Pulumi Slack workspace. cc @spender0 @Recrout @hghtwr @miguelteixeiraa @lunacrafts

EronWright avatar Oct 23 '24 20:10 EronWright