kapp-controller
kapp-controller copied to clipboard
kapp-controller pod crashes with CrashLoopBackoff error and timed out error in it's logs
What steps did you take: Deployed kapp controller using below command in our k8s cluster having version 1.20.6
kubectl apply -f https://github.com/vmware-tanzu/carvel-kapp-controller/releases/latest/download/release.yml
What happened: The kapp-controller object's pod crashes and continuously restarts itself with CrashLoopBackoff reason. The pod's detailed log is shared below:-
{"level":"info","ts":1642515666.2918875,"logger":"kc.main","msg":"kapp-controller","version":"v0.31.0"} {"level":"info","ts":1642515666.2919252,"logger":"kc.controller","msg":"start controller"} {"level":"info","ts":1642515666.2919319,"logger":"kc.controller","msg":"setting up manager"} I0118 14:21:07.343016 15 request.go:665] Waited for 1.030614398s due to client-side throttling, not priority and fairness, request: GET:https://10.x.x.x:443/apis/rbac.authorization.k8s.io/v1?timeout=32s {"level":"info","ts":1642515669.9956732,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"} {"level":"info","ts":1642515669.9991837,"logger":"kc.controller","msg":"setting up controller"} {"level":"info","ts":1642515670.0135179,"logger":"kc.controller","msg":"setting up metrics"} I0118 14:21:10.622679 15 serving.go:341] Generated self-signed cert (/home/kapp-controller/kc-agg-api-selfsigned-certs/kapp-controller.crt, /home/kapp-controller/kc-agg-api-selfsigned-certs/kapp-controller.key) I0118 14:21:10.622939 15 apiserver.go:211] Syncing CA certificate with APIServices I0118 14:21:11.074417 15 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook. I0118 14:21:11.074495 15 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook. I0118 14:21:11.110261 15 secure_serving.go:266] Serving securely on 0.0.0.0:10350 I0118 14:21:11.110342 15 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController I0118 14:21:11.110355 15 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController I0118 14:21:11.110396 15 dynamic_serving_content.go:129] "Starting controller" name="serving-cert::/home/kapp-controller/kc-agg-api-selfsigned-certs/kapp-controller.crt::/home/kapp-controller/kc-agg-api-selfsigned-certs/kapp-controller.key" I0118 14:21:11.110452 15 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" I0118 14:21:11.110459 15 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0118 14:21:11.110567 15 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" I0118 14:21:11.110576 15 apf_controller.go:312] Starting API Priority and Fairness config controller I0118 14:21:11.110593 15 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0118 14:21:11.111424 15 tlsconfig.go:240] "Starting DynamicServingCertificateController" I0118 14:21:11.210814 15 apf_controller.go:317] Running API Priority and Fairness config worker I0118 14:21:11.210826 15 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0118 14:21:11.210842 15 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController I0118 14:21:11.211015 15 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file {"level":"error","ts":1642515731.5785398,"logger":"kc.main","msg":"Exited run with error","error":"Starting API server: timed out after 1m0s waiting for api server to become healthy. Check the status by running
kubectl get apiservices v1alpha1.data.packaging.carvel.dev -o yaml"} {"level":"error","ts":1642515731.5832264,"logger":"kc.init","msg":"Could not start controller","error":"exit status 1"}
What did you expect: The pod in ready state without any error.
Anything else you would like to add: Similar issue (#328) mentioned here. To double check that it's not due to dis-allowed container port used in kapp-controller we even changed the targetPort of kapp-controller pod to one that works for other controller but we faced the same timeout error. So, that root cause is ruled out as described in issue #388.
We even added the max resource limits to controller deployment object to 2 cpu and mem 2 Gi but that too did not help.
Environment:
- kapp Controller version (execute
kubectl get deployment -n kapp-controller kapp-controller -o yamland the annotation iskbld.k14s.io/images): kbld.k14s.io/images - Kubernetes version (use
kubectl version) - 1.20.6
Vote on this request
This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.
👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"
We are also happy to receive and review Pull Requests if you want to help working on this issue.
@nayan-mistry - thanks for the bug report!
When i run kubectl apply -f https://github.com/vmware-tanzu/carvel-kapp-controller/releases/latest/download/release.yml against my local minikube it works, so i'm not able to immediately reproduce.
Can you describe the platform provider and kubernetes distribution/flavor (aws, gcp, ...?)
Also per the output you provided, can you share the output of:
kubectl get apiservices v1alpha1.data.packaging.carvel.dev -o yaml
this error often suggests that the kapp-controller process is unable to contact the main kubernetes control plane server due to network restrictions. or that the kubernetes control plane is unable to contact kapp-controller's apiserver.
I have a similar problem (and assume this is only because in meanwhile the version is newer).
The error does not happen for me when I install tanzu-cluster-essentials on a PUBLIC gcloud gke cluster, it only happens on PRIVATE ones.
For reference, this reproduces the error:
- Create a vpc network "tanzu" with at least one subnet that is also called "tanzu"
- Create a private GKE cluster
gcloud container clusters create tanzu \
--network "projects/wgs-tekton-tools/global/networks/tanzu" \
--no-enable-master-authorized-networks \
--enable-ip-alias \
--enable-private-nodes \
--master-ipv4-cidr 172.16.0.32/28 \
--zone "europe-west3-a" \
--machine-type "c2-standard-4" \
--enable-master-global-access
- Grant the vpc network "tanzu" internet access by creating a Cloud NAT gateway.
--> Verify internet access via
k run -i --tty --rm test --image=busybox --restart=Neverand then executing e.g.:wget https://github.com - Follow https://network.tanzu.vmware.com/products/tanzu-cluster-essentials
- The install.sh then produces the following output:
5:42:15PM: ---- waiting on 1 changes [16/17 done] ----
5:42:50PM: ongoing: reconcile apiservice/v1alpha1.data.packaging.carvel.dev (apiregistration.k8s.io/v1) cluster
5:42:50PM: ^ Condition Available is not True (False)
5:43:16PM: ---- waiting on 1 changes [16/17 done] ----
5:43:51PM: ongoing: reconcile apiservice/v1alpha1.data.packaging.carvel.dev (apiregistration.k8s.io/v1) cluster
5:43:51PM: ^ Condition Available is not True (False)
kapp: Error: Timed out waiting after 15m0s
At similar time the logs from kap-controller pod:
{"level":"info","ts":1648913835.9361649,"logger":"kc.main","msg":"kapp-controller","version":"0.30.0"}
{"level":"info","ts":1648913835.936192,"logger":"kc.init","msg":"start init"}
{"level":"info","ts":1648913835.936211,"logger":"kc.init","msg":"starting zombie reaper"}
{"level":"info","ts":1648913836.005206,"logger":"kc.main","msg":"kapp-controller","version":"0.30.0"}
{"level":"info","ts":1648913836.0052266,"logger":"kc.controller","msg":"start controller"}
{"level":"info","ts":1648913836.0052297,"logger":"kc.controller","msg":"setting up manager"}
I0402 15:37:17.055809 15 request.go:645] Throttling request took 1.035127713s, request: GET:https://10.64.0.1:443/apis/coordination.k8s.io/v1?timeout=32s
{"level":"info","ts":1648913840.008808,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1648913840.0089943,"logger":"kc.controller","msg":"setting up controller"}
{"level":"info","ts":1648913840.0168705,"logger":"kc.controller","msg":"setting up metrics"}
I0402 15:37:20.270452 15 serving.go:325] Generated self-signed cert (/home/kapp-controller/kc-agg-api-selfsigned-certs/kapp-controller.crt, /home/kapp-controller/kc-agg-api-selfsigned-certs/kapp-controller.key)
I0402 15:37:20.270616 15 apiserver.go:190] Syncing CA certificate with APIServices
I0402 15:37:20.701007 15 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0402 15:37:20.701022 15 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0402 15:37:20.736361 15 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0402 15:37:20.736377 15 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0402 15:37:20.736392 15 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0402 15:37:20.736396 15 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0402 15:37:20.736405 15 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0402 15:37:20.736409 15 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0402 15:37:20.736665 15 secure_serving.go:202] Serving securely on [::]:10350
I0402 15:37:20.736674 15 dynamic_serving_content.go:130] Starting serving-cert::/home/kapp-controller/kc-agg-api-selfsigned-certs/kapp-controller.crt::/home/kapp-controller/kc-agg-api-selfsigned-certs/kapp-controller.key
I0402 15:37:20.736774 15 tlsconfig.go:240] Starting DynamicServingCertificateController
I0402 15:37:20.836734 15 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0402 15:37:20.836736 15 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0402 15:37:20.836746 15 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
{"level":"error","ts":1648913901.1874132,"logger":"kc.main","msg":"Exited run with error","error":"Starting API server: timed out after 1m0s waiting for api server to become healthy","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tgithub.com/go-logr/[email protected]/zapr.go:132\nmain.main\n\t./main.go:44\nruntime.main\n\truntime/proc.go:255"}
{"level":"error","ts":1648913901.1902664,"logger":"kc.init","msg":"Could not start controller","error":"exit status 1","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tgithub.com/go-logr/[email protected]/zapr.go:132\ngithub.com/vmware-tanzu/carvel-kapp-controller/cmd/controllerinit.Run\n\tgithub.com/vmware-tanzu/carvel-kapp-controller/cmd/controllerinit/run.go:38\nmain.main\n\t./main.go:51\nruntime.main\n\truntime/proc.go:255"}
kubectl get apiservices v1alpha1.data.packaging.carvel.dev -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
annotations:
kapp.k14s.io/identity: v1;/apiregistration.k8s.io/APIService/v1alpha1.data.packaging.carvel.dev;apiregistration.k8s.io/v1
kapp.k14s.io/original: '{"apiVersion":"apiregistration.k8s.io/v1","kind":"APIService","metadata":{"labels":{"kapp.k14s.io/app":"1648913314565128660","kapp.k14s.io/association":"v1.8f4f97a034b82fceb38c3ec8217955e0"},"name":"v1alpha1.data.packaging.carvel.dev"},"spec":{"group":"data.packaging.carvel.dev","groupPriorityMinimum":100,"service":{"name":"packaging-api","namespace":"kapp-controller"},"version":"v1alpha1","versionPriority":100}}'
kapp.k14s.io/original-diff-md5: f3adcd1607ee7b415a869b7180ed4358
creationTimestamp: "2022-04-02T15:28:37Z"
labels:
kapp.k14s.io/app: "1648913314565128660"
kapp.k14s.io/association: v1.8f4f97a034b82fceb38c3ec8217955e0
name: v1alpha1.data.packaging.carvel.dev
resourceVersion: "5441"
uid: 1617fc19-7e4a-44b8-8bf2-d2fab6111b40
spec:
caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURjRENDQWxpZ0F3SUJBZ0lCQWpBTkJna3Foa2lHOXcwQkFRc0ZBREFvTVNZd0pBWURWUVFEREIxcllYQncKTFdOdmJuUnliMnhzWlhJdFkyRkFNVFkwT0RreE16ZzBNREFlRncweU1qQTBNREl4TkRNM01qQmFGdzB5TXpBMApNREl4TkRNM01qQmFNQ1V4SXpBaEJnTlZCQU1NR210aGNIQXRZMjl1ZEhKdmJHeGxja0F4TmpRNE9URXpPRFF3Ck1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0NBUThBTUlJQkNnS0NBUUVBdGxGVTA0Sis1U0llc3NvTEUreTkKckhHRVFOSFA3elFRdlVsL2pLd2sxUlkzSW5GVGdkdXhIdUQrOEZrVjN1NHc2VitYNkRyK2d0YlRuNk56VDMvMgpuVHNWeU5KQ05DVk1ualFiYmM1Vk9wTmdiLyt5WWpsSGJMdGpVU2pTZzhKcG1Jd1RXNmpBVUxqdngvclA3aUljCnFGV0Rua1FFVlRHZmpiRWxtV0F6Yk11emxNTXE0OWgvdDJpUk9kQkV6ekhEcS9VZ2lnc3h2L2tDdXRpd2hyOFcKbWIrTlE1bWZOWFVpSUhXR2Q5Z2VYbDE0WkdKV1o3WXFFaGRXWjN5Ui9GbWJ5TnpzTE1nVEFCdEZnenNBVnM1RwpKR0I5RStLRFdLOE5sZ21GTkJmZHVQRGV4cC8wNkFUOW9ZeXpGWDlIK0gvWTNMMkFFck04UW1IeUdZY2tXN0s2ClJ3SURBUUFCbzRHbk1JR2tNQTRHQTFVZER3RUIvd1FFQXdJRm9EQVRCZ05WSFNVRUREQUtCZ2dyQmdFRkJRY0QKQVRBTUJnTlZIUk1CQWY4RUFqQUFNQjhHQTFVZEl3UVlNQmFBRkN3bVBOT2UyWm4zc0ExQjkrN1IreUMxSDVtOQpNRTRHQTFVZEVRUkhNRVdDRDJ0aGNIQXRZMjl1ZEhKdmJHeGxjb0loY0dGamEyRm5hVzVuTFdGd2FTNXJZWEJ3CkxXTnZiblJ5YjJ4c1pYSXVjM1pqZ2dsc2IyTmhiR2h2YzNTSEJIOEFBQUV3RFFZSktvWklodmNOQVFFTEJRQUQKZ2dFQkFLemQxK0NiYTJtU3lTUnZEeCs4TGpPOUtYZXBDV0NTeWdWMEUwWHgyK1ZVamNPS21YdlhsN0l4UzMxcgpZQUpzMTI1VjlRRTB3ZHk3MVJuSlNEVmZsdk9XV3l4YnJWT0dtZUdrbmtWVTNKdk4xN1JRZ0pQcGJxbXVJbjJwCmE0WFp1TCt2RTFoVjAxdXVmRExvU0dIR2VqNkRoQkZqRjBtc2JsOGJDcUtQYzFBeUdHdm90bDZrQklpMmY0YkIKck5CSWgzL1pCc0d3WHR0RmZvUStyN0NHaXRwL3dnN24wT2RORmwwazUyWkdyYm5wdDRkY1JJdXhkVklqMGNmRApBNFZHSi91VktLVHdQN0FqYUhEOHBTNXNyeDF3RCtrd1lua1RXUlJrdDdmdkpsby9wdjhsRzFadThLR2pLNExkCmFUQzlUWWxSY0dUY00xWGI1b2h3d29HSFl5bz0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQotLS0tLUJFR0lOIENFUlRJRklDQVRFLS0tLS0KTUlJRERUQ0NBZldnQXdJQkFnSUJBVEFOQmdrcWhraUc5dzBCQVFzRkFEQW9NU1l3SkFZRFZRUUREQjFyWVhCdwpMV052Ym5SeWIyeHNaWEl0WTJGQU1UWTBPRGt4TXpnME1EQWVGdzB5TWpBME1ESXhORE0zTWpCYUZ3MHlNekEwCk1ESXhORE0zTWpCYU1DZ3hKakFrQmdOVkJBTU1IV3RoY0hBdFkyOXVkSEp2Ykd4bGNpMWpZVUF4TmpRNE9URXoKT0RRd01JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0NBUThBTUlJQkNnS0NBUUVBM2RJTVplL3d4WGFOVE9NRQpINnRScFNiczNPazZ6SVVKbnF6NDhjOVcwak84angzR2tYa0x1aitVbUZ2NGIvcmFDWjBIMktYaFNPYnY1T0RkCmdhK3k2T2lyUFZ4bDBIRm1IaTI5SS9WQ1l0bkZkcUt2U2ZCdVM1ZGRmUDl2L1lSQXcrS3ZGSG92ZUxkVUkrYnYKQmt4clZnUGQ5SWo1Wk9hWHN6Y3cycS94eVBNb2tlMEdzbEFOTmxRdWdnY0pTM0J1Uy9wYnkwQ1hqS2E0dEpleQpRTHRwc0plWDVFVnJDVlMrOE9uSjlmSzVSdDZBOGNWWXErdDJUUlJtU0lkeUVzRUp5czhZTGZEa1RDRzFWVVJ3CnM5SXNMQlh5YUE1RnhtSExnMCs1Q0xMNk1YZWIyaTRINmEwVVNBODVxeWIzNE1YUDRjQkhjaDhhbFJibnk5eWgKOHRCeVp3SURBUUFCbzBJd1FEQU9CZ05WSFE4QkFmOEVCQU1DQXFRd0R3WURWUjBUQVFIL0JBVXdBd0VCL3pBZApCZ05WSFE0RUZnUVVMQ1k4MDU3Wm1mZXdEVUgzN3RIN0lMVWZtYjB3RFFZSktvWklodmNOQVFFTEJRQURnZ0VCCkFEdlc5MlE5bTlhU2RXcFpLMDI1REVwbjRDZFNLcHdrbHorVlNDRGczYTlrRVBEb2RIaVFQbUgvNnlleTIzMEsKczBmQ3pmdEJPSVlsaTVYVzJteE9lTzUyRjY1M1AwRi8rdHN0NENjSkdjZ1RqWEJJSWVPbFRsNlpVcHhQdFVBNQo4ZkFOdmppd2pSZE91ZmJJaXpiTGVTRVhpaFNZOHdhOU5oZmJZcDdGWnRpLzYyRnFrLzMyR1NSMmE2a1h4N2xGCmpySHE5azNUMSs2a0ppcHJ0T1ZBOUpCWkdxSjBwa2VDZzRleVkyL1p0d3NCZXViVE5XUjRPOE9VM3lLZmFvL3EKbXp3ZWw3WHNkMkRibHZtRVd5NXhITm04enpJN1dPUjJ0Q05sR2dxRnVTMVdISjVsVnpOeDN1NFNuVDdscEhwRQpYUTlrZWFHdmNncUxwL2NkekEvZnA0WT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
group: data.packaging.carvel.dev
groupPriorityMinimum: 100
service:
name: packaging-api
namespace: kapp-controller
port: 443
version: v1alpha1
versionPriority: 100
status:
conditions:
- lastTransitionTime: "2022-04-02T15:28:37Z"
message: endpoints for service/packaging-api in "kapp-controller" have no addresses
with port name ""
reason: MissingEndpoints
status: "False"
type: Available
running kubectl apply -f https://github.com/vmware-tanzu/carvel-kapp-controller/releases/latest/download/release.yml after the failed installation produces simliar output:
{"level":"info","ts":1648914482.3227222,"logger":"kc","msg":"Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file\n"}
{"level":"info","ts":1648914482.3231158,"logger":"kc","msg":"Serving securely on [::]:10350\n"}
{"level":"info","ts":1648914482.323177,"logger":"kc","msg":"Starting DynamicServingCertificateController"}
{"level":"info","ts":1648914482.3232036,"logger":"kc","msg":"Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file\n"}
{"level":"info","ts":1648914482.3232286,"logger":"kc","msg":"Starting API Priority and Fairness config controller\n"}
{"level":"info","ts":1648914482.422962,"logger":"kc","msg":"Caches are synced for RequestHeaderAuthRequestController \n"}
{"level":"info","ts":1648914482.4240189,"logger":"kc","msg":"Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file \n"}
{"level":"info","ts":1648914482.4240532,"logger":"kc","msg":"Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file \n"}
{"level":"info","ts":1648914482.424103,"logger":"kc","msg":"Running API Priority and Fairness config worker\n"}
{"level":"error","ts":1648914542.7522597,"logger":"kc.main","msg":"Exited run with error","error":"Starting API server: timed out after 1m0s waiting for api server to become healthy. Check the status by running `kubectl get apiservices v1alpha1.data.packaging.carvel.dev -o yaml`","stacktrace":"runtime.main\n\truntime/proc.go:255"}
{"level":"error","ts":1648914542.7551491,"logger":"kc.init","msg":"Could not start controller","error":"exit status 1","stacktrace":"main.main\n\tgithub.com/vmware-tanzu/carvel-kapp-controller/cmd/main.go:53\nruntime.main\n\truntime/proc.go:255"}
which also results in a Crashloop.
It would be really great if tanzu could be installed a private GKE cluster, since otherwise each kubernetes node consumes a public IP (and there is only a very limited set up pulblic IPs available)!
@gerrnot thanks for the repro steps, I'll take a look
@gerrnot your issue turns out to be very similar to https://github.com/cert-manager/cert-manager/issues/2109
If you create a firewall rule allowing ingress from the control plane to the workers on 10350 this will be fixed:
gcloud compute --project=wgs-tekton-tools firewall-rules create wgs-tekton-tools-kctrl --direction=INGRESS --priority=1000 --network=tanzu --action=ALLOW --rules=tcp:10350 --source-ranges=172.16.0.32/28 --target-tags=gke-tanzu-cf831963-node
you'll need to set gke-tanzu-cf831963-node to whatever the network tag for your node pool VMs are
@benmoss : Thanks a lot! As the OP also clicked the celebrate button, I think this can be closed.
I don't think that was @nayan-mistry unfortunately 😄 . We'll leave it open and see if they respond with more details.