istio-csr
istio-csr copied to clipboard
Istio-csr pods were hung unable to handle request causes entire cluster downtime for new pods/expired pods.
We are running with cert-manager-istio-csr 2 pods ,
one pods was hung at
2022-03-24T18:59:47.311227Z info klog tls-provider "msg"="renewing serving certificate"
2022-03-24T18:59:47.556644Z info klog tls-provider "msg"="serving certificate ready"
2022-03-24T18:59:47.556869Z info klog tls-provider "msg"="fetched new serving certificate" "expiry-time"="2022-03-25T18:59:47Z"
2022-03-24T18:59:47.556722Z info spiffe Added 1 certs to trust domain cluster.local in peer cert verifier
2022-03-24T18:59:47.556896Z info klog tls-provider "msg"="waiting to renew certificate" "renewal-time"="2022-03-25T10:59:47.185630849Z"
2022-03-25T06:54:05.842512Z info klog manager/events "msg"="Normal" "message"="cert-manager-istio-csr-689dfdb989-9k9wj_113e526f-7327-43f1-a2c2-81715b0a5573 became leader" "object"={"kind":"ConfigMap","namespace":"istio-system","name":"istio-csr","uid":"8771468c-c206-4b9d-bf89-11f4f55e75b6","apiVersion":"v1","resourceVersion":"732068782"} "reason"="LeaderElection"
022-03-25T06:54:05.842772Z info klog manager/controller/configmap "msg"="Starting EventSource" "reconciler group"="" "reconciler kind"="ConfigMap" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{}}}
2022-03-25T06:54:05.842630Z info klog manager/controller/configmap "msg"="Starting EventSource" "reconciler group"="" "reconciler kind"="ConfigMap" "source"={"Type":{"kind":"ConfigMap","apiVersion":"v1","metadata":{"creationTimestamp":null}}}
2022-03-25T06:54:05.842545Z info klog manager/events "msg"="Normal" "message"="cert-manager-istio-csr-689dfdb989-9k9wj_113e526f-7327-43f1-a2c2-81715b0a5573 became leader" "object"={"kind":"Lease","namespace":"istio-system","name":"istio-csr","uid":"b6ba9c83-a619-43bb-8799-05a45f92781a","apiVersion":"coordination.k8s.io/v1","resourceVersion":"732068783"} "reason"="LeaderElection"
2022-03-25T06:54:05.842346Z info klog successfully acquired lease istio-system/istio-csr
And other actual leader one hung at
2022-03-24T18:59:47.312223Z info klog tls-provider "msg"="renewing serving certificate"
2022-03-24T18:59:47.760825Z info klog tls-provider "msg"="fetched new serving certificate" "expiry-time"="2022-03-25T18:59:47Z"
2022-03-24T18:59:47.760845Z info klog tls-provider "msg"="waiting to renew certificate" "renewal-time"="2022-03-25T10:59:47.253613531Z"
2022-03-24T18:59:47.760693Z info spiffe Added 1 certs to trust domain cluster.local in peer cert verifier
2022-03-24T18:59:47.760627Z info klog tls-provider "msg"="serving certificate ready"
2022-03-25T03:03:09.770135Z error klog error retrieving resource lock istio-system/istio-csr: etcdserver: request timed out
Due to this no csr requests was handled for the pods , So all the pods were unable to come up. pod error
{"level":"warn","time":"2022-03-25T06:50:28.724445Z","scope":"ca","msg":"ca request failed, starting attempt 1 in 101.313121ms"}
{"level":"warn","time":"2022-03-25T06:50:28.826778Z","scope":"ca","msg":"ca request failed, starting attempt 2 in 218.728544ms"}
{"level":"warn","time":"2022-03-25T06:50:29.046226Z","scope":"ca","msg":"ca request failed, starting attempt 3 in 365.315547ms"}
{"level":"warn","time":"2022-03-25T06:50:29.411679Z","scope":"ca","msg":"ca request failed, starting attempt 4 in 834.988896ms"}
{"level":"warn","time":"2022-03-25T06:50:30.247763Z","scope":"sds","msg":"failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2022-03-25T06:50:28Z is after 2022-03-25T03:59:47Z\""}
So almost 4hour istio-csr was not serving any request . Manual rolling restart of deployment solved the issue.
After rolling restart new errror messages started to come
k logs -n istio-system cert-manager-istio-csr-5858ff84b-hpvwd
2022-03-25T06:53:56.765755Z info klog "msg"="-----------------------------------------------------------------"
2022-03-25T06:53:56.765787Z info klog "msg"="Using root CAs from file: /var/run/secrets/istio-csr/tls.crt"
2022-03-25T06:53:56.765799Z info klog "msg"="-----------------------------------------------------------------"
2022-03-25T06:53:56.795849Z info klog attempting to acquire leader lease istio-system/istio-csr...
2022-03-25T06:53:56.795927Z info klog tls-provider/root-ca-watcher "msg"="loading root CAs bundle" "file"="/var/run/secrets/istio-csr/tls.crt"
2022-03-25T06:53:56.795987Z info klog manager "msg"="starting metrics server" "path"="/metrics"
2022-03-25T06:53:56.796043Z info klog tls-provider/root-ca-watcher "msg"="updating root CAs from file" "file"="/var/run/secrets/istio-csr/tls.crt"
2022-03-25T06:53:57.180152Z info klog tls-provider "msg"="serving certificate ready"
2022-03-25T06:53:57.180217Z info spiffe Added 1 certs to trust domain cluster.local in peer cert verifier
2022-03-25T06:53:57.180369Z info klog tls-provider "msg"="fetched initial serving certificate"
2022-03-25T06:53:57.180411Z info klog tls-provider "msg"="waiting to renew certificate" "renewal-time"="2022-03-25T22:53:57.060128122Z"
2022-03-25T06:53:57.297049Z info klog grpc-server "msg"="grpc serving" "serving-addr"="0.0.0.0:6443" "address"="[::]:6443"
2022-03-25T06:54:19.315736Z error klog Failed to update lock: Operation cannot be fulfilled on configmaps "istio-csr": the object has been modified; please apply your changes to the latest version and try again
2022-03-25T09:23:41.509654Z info klog Waited for 1.183671823s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/cert-manager.io/v1/namespaces/istio-system/certificaterequests/istio-csr-sdtmf
2022-03-25T10:10:02.886473Z error klog grpc-server "msg"="failed to authenticate request" "error"="failed to validate the JWT from cluster \"loadus1-eks\": the service account authentication returns an error: [invalid bearer token, rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /etc/kubernetes/key-server/pipe.sock: connect: no such file or directory\", Post \"https://127.0.0.1:21362/authenticate?timeout=30s\": dial tcp 127.0.0.1:21362: connect: connection refused]" "serving-addr"="0.0.0.0:6443"
2022-03-25T10:10:02.990037Z error klog grpc-server "msg"="failed to authenticate request" "error"="failed to validate the JWT from cluster \"loadus1-eks\": the service account authentication returns an error: [invalid bearer token, rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /etc/kubernetes/key-server/pipe.sock: connect: no such file or directory\", Post \"https://127.0.0.1:21362/authenticate?timeout=30s\": dial tcp 127.0.0.1:21362: connect: connection refused]" "serving-addr"="0.0.0.0:6443"
Hi @anannaya,
Can you please share the istio-csr version and configuration you used. Sharing your kube and istio version will also help.
one pods was hung at
Can you please set the logging to debug (--v=5
). It is not immediately clear to me that istio-csr is actually hanging here.
etcdserver: request timed out
This suggests that your etcd is under resourced. Are their other system components experiencing the same errors? What are the logs of the API server?
istio-csr version is --> 0.3.0 kubernetes version --> v1.19.13-eks-8df270
This suggests that your etcd is under resourced. Are their other system components experiencing the same errors? What are the logs of the API server? --> We use AWS EKS , i totally agree there might be intermediate control-plane network issues, but there should be retry ? OR other istio-csr pod should take care. kubectl commands was working, metrics everything was working as usual.
Now . I have upgraded the istio-csr to 0.4.0 , with a loglevel 5 .
Is the istio-csr used in production by anyone yet ?
Well, we were thinking about putting it in production until we saw this post. Was there any further resolution since March?