kapp-controller icon indicating copy to clipboard operation
kapp-controller copied to clipboard

Different deployment mode to support good performance for 10k Apps

Open showpune opened this issue 3 years ago • 3 comments

Describe the problem/challenge you have I need to use kapp-controller in a cluster to support deployment for 200 customers, and each customer has about 50 Apps. Totally we have more than 10000 Apps in the cluster.

  1. The queue is blocked because of so many apps
  2. More pods can't help because each pod shares the same queue, they will conflict.
  3. If one customer start lots of App, other customers will wait, I hope each customer can has they own queue and quota.

Describe the solution you'd like Add deployment mode support for kapp-controller, there are normal, cluster and namespace deployment mode support

  1. Normal as current implementation
  2. Cluster Model only reconcile the resource for cluster resource, like Package and PackageRepository
  3. In Namespace deployment model, each deployment can define the work namespace environment parameter in deployment, and the deployment only reconcile the resource in the give namespace, like App and In this way, the namespace deployment will process the CRDs in its own namespace, and cluster deployment process the cluster CRDs

Anything else you would like to add: There are some performance test result:

  1. In Normal deployment, If there are 300 Apps is deployed, the reconcile will be very slow and new App will be triggered after more than 5m. The deployment is to deploy to another Kubernetes cluster.
  2. I did some change for CRD App to support namespace (https://github.com/showpune/carvel-kapp-controller/tree/cloud-version), we still can get good performance even there are 5000 apps in 100 namespaces Current situation image Namespace deployment after change image

There is similar issue for kpack and cartographer.


Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

showpune avatar Jul 15 '22 09:07 showpune

hey @showpune, quick question,

The deployment is to deploy to another Kubernetes cluster.

are most of your Apps using spec.cluster to deploy to another cluster? (which also means it's not a primary concern that X number of kapp-controllers would be hitting same kubernetes cluster)

cppforlife avatar Jul 15 '22 17:07 cppforlife

Hi,

Yes and I use spec.cluster to deploy in my pressure test and hit the same Kubernetes cluster.
I don't think it is the problem to hit the same cluster because the TPS to the remote cluster is low when I deployed 250 Apps, but the reconcile is already blocked, normally it will take above 5 minutes to reconcile a App. But if I use one deployment of kapp-controller for each namespace, the App will be reconciled quickly, even the Apps are deployed into the same target cluster.

I think the root cause is that kapp-controller can't be scaled out because of share the same queue. I scale up the CPU to 2G but it helps little. And if all customer share the same queue, it is easy to be blocked by some Apps from namespace with big number of Apps. The source quota is not isolated

showpune avatar Jul 17 '22 05:07 showpune

👍. your assessment aligns with mine.

would you be interested in carrying the work that you did in your fork into a PR to this repo?

cppforlife avatar Jul 18 '22 17:07 cppforlife