consul-k8s
consul-k8s copied to clipboard
x509: Certificate expired issue when resuming paused k8s clusters
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
- Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
- If you are interested in working on this issue or have submitted a pull request, please leave a comment.
Overview of the Issue
When working with consul helm and upgrading it via helm - it starts showing this error: x509: certificate has expired or is not yet valid: current time 2021-11-09T05:37:48Z is after 2021-11-08T19:40:44Z
Reproduction Steps
Consul Version : 1.10.2
Install consul and try upgrading helm. This error shows up after sometime.
Expected behavior
Consul works without showing this error.
Additional context
This error is related to earlier issue of #808 which @lkysow fixed. These two issues pop-up regularly one after another.
Hi, In our environment we seems to have similar issues. As workaround we recreate consul-webhook-cert-manager pod. This is quite important to fix because if API fails to authenticate to consul admissions controllers (consul-connect-injector or consul-controller) this may impact whole deployments including those ones which even not meshed and absolutely independent from consul. According "consul-connect-injector.consul.hashicorp.com" mutating webhook config:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
annotations:
meta.helm.sh/release-name: consul
meta.helm.sh/release-namespace: consul
creationTimestamp: "2021-08-06T08:42:51Z"
generation: 21
labels:
app: consul
app.kubernetes.io/managed-by: Helm
chart: consul-helm
heritage: Helm
release: consul
name: consul-connect-injector-cfg
resourceVersion: "112996741"
uid: 6571f50e-ea63-45a3-a1e0-6b6b1ea6a21f
webhooks:
- admissionReviewVersions:
- v1beta1
- v1
clientConfig:
caBundle: redacted
service:
name: consul-connect-injector-svc
namespace: consul
path: /mutate
port: 443
failurePolicy: Fail
matchPolicy: Equivalent
name: consul-connect-injector.consul.hashicorp.com
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values:
- kube-system
- local-path-storage
- key: control-plane
operator: DoesNotExist
objectSelector:
matchExpressions:
- key: app
operator: NotIn
values:
- consul
reinvocationPolicy: Never
rules:
- apiGroups:
- ""
apiVersions:
- v1
operations:
- CREATE
resources:
- pods
scope: '*'
sideEffects: None
timeoutSeconds: 10
Each request to create pods will be sent to consul-connect-injector admission controller and if API fail to authenticate due to bad SSL whole operation will be failed and pod will be not created.
Currently we are running 1.10.3+ent, but same issues were in earlier versions as well.
Hi @andriktr or @amit106679 thanks for the feedback. Would it be possible to provide the confl.yaml to understand how you are deploying Consul K8s. Also I assume you are able to consistently reproduce the behavior?
Could you also get us the logs from the webhook cert manager?
Hi,
I'm not sure about how easy is to reproduce the behavior, because it's still unclear when this happens. Here is how the 3 day log of webhook cert manager looks:
Last time we saw issue at 10.15 11:46 (according log timestamp) and ~ at 10.15 11:50 webhook cert manager was restarted (you can see in the log above that it start to rotate/update certs)
Here is the log of consul-connect-injector admission controller ~ at same time:
As you can see until I restarted webhook cert manager pod consul-connect-injector drops:
2021/11/15 11:46:45 http: TLS handshake error from 10.162.216.119:45722: remote error: tls: bad certificate
2021/11/15 11:46:45 http: TLS handshake error from 10.162.216.119:45738: remote error: tls: bad certificate
2021/11/15 11:46:50 http: TLS handshake error from 10.162.216.119:46030: remote error: tls: bad certificate
2021/11/15 11:46:51 http: TLS handshake error from 10.162.216.119:46104: remote error: tls: bad certificate
2021/11/15 11:46:52 http: TLS handshake error from 10.162.216.119:46126: remote error: tls: bad certificate
2021/11/15 11:47:36 http: TLS handshake error from 10.162.216.119:48430: remote error: tls: bad certificate
At this time 11:46 - 11:50 we created a deployment (not related to the consul) and pods for this deployment were not created due to consul-connect-injector ssl error.
Few day earlier we saw similar ssl issues on our dev cluster then we tried to patch consul service default and this operation fails with
cannot patch "k8s-test-app-dev" with kind ServiceDefaults: Internal error occurred: failed calling webhook "mutate-servicedefaults.consul.hashicorp.com": Post "https://consul-controller-webhook.consul.svc:443/mutate-v1alpha1-servicedefaults?timeout=10s": x509: certificate signed by unknown authority
Also attaching our helm values.yaml
values.txt
I saw this on a minikube cluster I restarted. Different logs though:
2021-12-01T06:46:58.601Z ERROR controller.serviceintentions Reconciler error {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceIntentions", "name": "backend", "namespace": "default", "error": "Internal error occurred: failed calling webhook \"mutate-serviceintentions.consul.hashicorp.com\": failed to call webhook: Post \"https://consul-controller-webhook.default.svc:443/mutate-v1alpha1-serviceintentions?timeout=10s\": x509: certificate has expired or is not yet valid: current time 2021-12-01T06:46:58Z is after 2021-12-01T02:05:16Z"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2021-11-30T02:05:16.511Z [INFO] Updated certificate bundle received for consul-connect-injector-cfg; Updating webhook certs.
2021-11-30T02:05:16.595Z [INFO] Updated certificate bundle received for consul-controller-mutating-webhook-configuration; Updating webhook certs.
2021-11-30T02:05:16.910Z [INFO] Updating secret with new certificate: mutatingwebhookconfig=consul-controller-mutating-webhook-configuration secret=consul-controller-webhook-cert secretNS=default
2021-11-30T02:05:16.911Z [INFO] Updating secret with new certificate: mutatingwebhookconfig=consul-connect-injector-cfg secret=consul-connect-inject-webhook-cert secretNS=default
2021-11-30T02:05:16.933Z [INFO] Updating webhook configuration with new CA: mutatingwebhookconfig=consul-controller-mutating-webhook-configuration secret=consul-controller-webhook-cert secretNS=default
2021-11-30T02:05:16.933Z [INFO] Updating webhook configuration with new CA: mutatingwebhookconfig=consul-connect-injector-cfg secret=consul-connect-inject-webhook-cert secretNS=default
2021-12-01T03:10:28.001Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:10:28.004Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:11:29.160Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:11:29.161Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:12:30.289Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:12:30.372Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:13:31.322Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:13:31.409Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:14:32.426Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:14:32.442Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:15:33.467Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:15:33.471Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:16:34.489Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:16:34.506Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:17:35.444Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:17:35.631Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:18:36.462Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:18:36.672Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)
Clocks are in sync:
k exec -it consul-webhook-cert-manager-fd75d65cc-jzbh2 -- date
Wed Dec 1 06:50:55 UTC 2021
k exec -it consul-controller-5596567966-ff4m2 -- date
Wed Dec 1 06:51:15 UTC 2021
Believe that this might be related to the nodes restart as we saw similar behaviour right after the maintenance where worker nodes were restarted one by one.
Actually I think I figured out the issue in my case. It was because I had paused and then unpaused Docker desktop.
Will close as its related to pausing and unpausing nodes which is infra related than Consul K8s related.