kamaji Tenant Kubernetes control-planes are dependent on user webhooks and api services

This is more of a question than a bug report, although it looks like an upstream issue. You might have run into it already, because the problem is especially painful when you’re running managed control-planes.

Steps to reproduce

Create a new cluster that uses the Kamaji control-plane.
Install cert-manager and metrics-server in that cluster.
Delete all worker nodes in the cluster.

After step 3 you’ll notice that the kube-apiserver containers begin to restart continuously, showing errors similar to:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 27 Jun 2025 03:45:19 +0200
      Finished:     Fri, 27 Jun 2025 03:45:56 +0200

logs:

I0627 01:51:13.519903       1 storage_scheduling.go:111] all system priority classes are created successfully or already exist.
W0627 01:51:13.598365       1 handler_proxy.go:99] no RequestInfo found in the context
E0627 01:51:13.598470       1 controller.go:102] "Unhandled Error" err=<
        loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
        , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
 > logger="UnhandledError"
I0627 01:51:13.639692       1 controller.go:109] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
W0627 01:51:13.770139       1 handler_proxy.go:99] no RequestInfo found in the context
E0627 01:51:13.778185       1 controller.go:113] "Unhandled Error" err="loading OpenAPI spec for \"v1beta1.metrics.k8s.io\" failed with: Error, could not get list of group versions for APIService" logger="UnhandledError"
I0627 01:51:13.780788       1 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
I0627 01:51:36.743669       1 controller.go:615] quota admission added evaluator for: roles.rbac.authorization.k8s.io
I0627 01:51:37.045507       1 controller.go:615] quota admission added evaluator for: rolebindings.rbac.authorization.k8s.io
I0627 01:52:02.718675       1 controller.go:615] quota admission added evaluator for: serviceaccounts
I0627 01:52:08.733639       1 controller.go:615] quota admission added evaluator for: daemonsets.apps
I0627 01:52:08.748588       1 controller.go:615] quota admission added evaluator for: deployments.apps
W0627 01:52:13.640042       1 handler_proxy.go:99] no RequestInfo found in the context
E0627 01:52:13.640116       1 controller.go:102] "Unhandled Error" err=<
        loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
        , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
 > logger="UnhandledError"
I0627 01:52:13.641696       1 controller.go:109] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
W0627 01:52:13.787279       1 handler_proxy.go:99] no RequestInfo found in the context
E0627 01:52:13.787336       1 controller.go:113] "Unhandled Error" err="loading OpenAPI spec for \"v1beta1.metrics.k8s.io\" failed with: Error, could not get list of group versions for APIService" logger="UnhandledError"
I0627 01:52:13.788502       1 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
I0627 01:52:18.045205       1 controller.go:615] quota admission added evaluator for: ciliumendpoints.cilium.io
E0627 01:52:38.795461       1 writers.go:123] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout" logger="UnhandledError"
W0627 01:52:38.795464       1 dispatcher.go:217] Failed calling webhook, failing closed webhook.cert-manager.io: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cozy-cert-manager.svc:443/validate?timeout=30s": context canceled
E0627 01:52:38.795993       1 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 579.927µs, panicked: false, err: context canceled, panic-reason: <nil>" logger="UnhandledError"
E0627 01:52:38.796591       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout" logger="UnhandledError"
E0627 01:52:38.798136       1 writers.go:136] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout" logger="UnhandledError"
E0627 01:52:38.799501       1 timeout.go:140] "Post-timeout activity" logger="UnhandledError" timeElapsed="4.545858ms" method="PATCH" path="/apis/cert-manager.io/v1/namespaces/cozy-ingress-nginx/certificates/ingress-nginx-root-cert" result=null
W0627 01:52:38.810270       1 dispatcher.go:217] Failed calling webhook, failing closed webhook.cert-manager.io: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cozy-cert-manager.svc:443/validate?timeout=30s": dial tcp 10.95.75.223:443: connect: operation not permitted

This puts us in a deadlock situation again:

main API server depends on custom API servers and admission webhooks.
Those webhooks can’t start without worker nodes.
The worker nodes can’t be provisioned because the control-plane keeps restarting.

Eventually, everything does recover, but until the workers come up the Kubernetes API keeps flapping because kube-apiserver is repeatedly restarted.

Questions

Have you encountered this behaviour before?
Is there a generic way to avoid or mitigate it?
Is there a hidden flag or configuration option for kube-apiserver that could help?

Jun 27 '25 10:06 kvaps

I think we hit an egg/chicken issue in Kubernetes, I'd like to assist you in understand how we could get this mitigated by interacting with the Kubernetes community.

At first sight, I think webhook failure policies should be changed to Ignore but you can't start anymore the cluster. A temporary workaround could be ignoring the Dynamic Admission Controller with the API Server flag -disable-admission-plugins=MutatingAdmissionWebhook,ValidatingAdmissionWebhook.

Once the API Server is up and running, you could bring a new node, and finally revert the API Server back to its desired state.

CLASTIX has customers with autoscaling enabled, meaning Control Plane pods are launched and then worker nodes are scaled using the CAPI Auto Scaler: so far, nobody had such issues, but maybe it's just a sort of luck.

Jun 27 '25 14:06 prometherion

At first sight, I think webhook failure policies should be changed to Ignore but you can't start anymore the cluster.

I think this is wrong, because it might break user expectation that some requests will not be validated anymore.

As one of idea is to having two separate APIservers, one for user requests and second for ClusterAPI and Kamaji needs.

Jun 30 '25 08:06 kvaps

As one of idea is to having two separate APIservers, one for user requests and second for ClusterAPI and Kamaji needs.

I'm not understanding how to separate these two, especially considering that if the CAPI Deployment/Pod restarts, you'll end up with the same issue.

It seems to me a problem at the upstream level.

Jun 30 '25 08:06 prometherion