kapp-controller icon indicating copy to clipboard operation
kapp-controller copied to clipboard

installation of 10 packages concurrently forces GKE control plane autoscaling

Open aaronshurley opened this issue 3 years ago • 3 comments

What steps did you take: Reported from other users: Installed kapp-controller (as a part of a larger product, TAP) on a new GKE cluster.

What happened: During the installation, the Kubernetes control plane became unavailable for several minutes. This caused package installs to enter a ReconcileFailed state. Eventually, when the API server became available, packages reconciled again to completion.

What did you expect: The installation works without any control plane unavailability.

Anything else you would like to add:

  • This identified behavior may happen on newly provisioned clusters that have not gone through GKE API server autoscaling (API server sizing in GKE is automatic, non-configurable, and is not determined based on size or number of nodes). Once GKE scales up the API server, the current install will continue and any subsequent installs succeed without interruption. Could this be improved by adjusting the kapp-controller's concurrency config (default is 10, what if we reduced it to 5)?
  • If it's difficult to replicate with minimal components (such as kapp-controller on its own), try larger distributions (such as TAP).

Environment:

  • kapp Controller version: v0.30.0 (latest)

Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

aaronshurley avatar Dec 09 '21 16:12 aaronshurley

I think a logical next step here would be two try and reproduce this with kapp-controller itself on GKE.

We should also test with TAP to verify any fix works as expected.

Could this be improved by adjusting the kapp-controller's concurrency config (default is 10, what if we reduced it to 5)?

A note here that this is current configurable via a kapp-controller flag: https://github.com/vmware-tanzu/carvel-kapp-controller/blob/1ad808d8909d49f0dff35ee49d285fb5f0e4693f/cmd/main.go#L25

danielhelfand avatar Dec 10 '21 17:12 danielhelfand

@cppforlife filed a support ticket with GCP and this was the response:

GKE autoscales control plane -- not a configurable thing. so question becomes can we find that threshold (and potentially avoid hitting) after which GKE decides to scale.

One potential solution to this would be documenting kapp-controlller installation on GKE and advising to set concurrency to a lower amount (e.g. 5). This way we could still keep current default of 10.

danielhelfand avatar Dec 13 '21 17:12 danielhelfand

^ do we know that 5 for example, "fixes" the behaviour?

cppforlife avatar Dec 15 '21 16:12 cppforlife

I believe this is just the accepted behaviour of GKE. We cannot change it as the main nodes are scaled and managed by google. Anything further ? @cppforlife

neil-hickey avatar Feb 22 '23 20:02 neil-hickey