consul-k8s icon indicating copy to clipboard operation
consul-k8s copied to clipboard

x509: Certificate expired issue when resuming paused k8s clusters

Open amit106679 opened this issue 3 years ago • 7 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

When working with consul helm and upgrading it via helm - it starts showing this error: x509: certificate has expired or is not yet valid: current time 2021-11-09T05:37:48Z is after 2021-11-08T19:40:44Z

Reproduction Steps

Consul Version : 1.10.2

Install consul and try upgrading helm. This error shows up after sometime.

Expected behavior

Consul works without showing this error.

Additional context

This error is related to earlier issue of #808 which @lkysow fixed. These two issues pop-up regularly one after another.

amit106679 avatar Nov 11 '21 03:11 amit106679

Hi, In our environment we seems to have similar issues. As workaround we recreate consul-webhook-cert-manager pod. This is quite important to fix because if API fails to authenticate to consul admissions controllers (consul-connect-injector or consul-controller) this may impact whole deployments including those ones which even not meshed and absolutely independent from consul. According "consul-connect-injector.consul.hashicorp.com" mutating webhook config:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  annotations:
    meta.helm.sh/release-name: consul      
    meta.helm.sh/release-namespace: consul 
  creationTimestamp: "2021-08-06T08:42:51Z"
  generation: 21
  labels:
    app: consul
    app.kubernetes.io/managed-by: Helm     
    chart: consul-helm
    heritage: Helm
    release: consul
  name: consul-connect-injector-cfg
  resourceVersion: "112996741"
  uid: 6571f50e-ea63-45a3-a1e0-6b6b1ea6a21f
webhooks:
- admissionReviewVersions:
  - v1beta1
  - v1
  clientConfig:
    caBundle: redacted
    service:
      name: consul-connect-injector-svc
      namespace: consul
      path: /mutate
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: consul-connect-injector.consul.hashicorp.com
  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - kube-system
      - local-path-storage
    - key: control-plane
      operator: DoesNotExist
  objectSelector:
    matchExpressions:
    - key: app
      operator: NotIn
      values:
      - consul
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    scope: '*'
  sideEffects: None
  timeoutSeconds: 10

Each request to create pods will be sent to consul-connect-injector admission controller and if API fail to authenticate due to bad SSL whole operation will be failed and pod will be not created. Currently we are running 1.10.3+ent, but same issues were in earlier versions as well.

andriktr avatar Nov 15 '21 21:11 andriktr

Hi @andriktr or @amit106679 thanks for the feedback. Would it be possible to provide the confl.yaml to understand how you are deploying Consul K8s. Also I assume you are able to consistently reproduce the behavior?

david-yu avatar Nov 15 '21 21:11 david-yu

Could you also get us the logs from the webhook cert manager?

lkysow avatar Nov 15 '21 21:11 lkysow

Hi, I'm not sure about how easy is to reproduce the behavior, because it's still unclear when this happens. Here is how the 3 day log of webhook cert manager looks: image Last time we saw issue at 10.15 11:46 (according log timestamp) and ~ at 10.15 11:50 webhook cert manager was restarted (you can see in the log above that it start to rotate/update certs) Here is the log of consul-connect-injector admission controller ~ at same time: image As you can see until I restarted webhook cert manager pod consul-connect-injector drops:

2021/11/15 11:46:45 http: TLS handshake error from 10.162.216.119:45722: remote error: tls: bad certificate
2021/11/15 11:46:45 http: TLS handshake error from 10.162.216.119:45738: remote error: tls: bad certificate
2021/11/15 11:46:50 http: TLS handshake error from 10.162.216.119:46030: remote error: tls: bad certificate
2021/11/15 11:46:51 http: TLS handshake error from 10.162.216.119:46104: remote error: tls: bad certificate
2021/11/15 11:46:52 http: TLS handshake error from 10.162.216.119:46126: remote error: tls: bad certificate
2021/11/15 11:47:36 http: TLS handshake error from 10.162.216.119:48430: remote error: tls: bad certificate

At this time 11:46 - 11:50 we created a deployment (not related to the consul) and pods for this deployment were not created due to consul-connect-injector ssl error.

Few day earlier we saw similar ssl issues on our dev cluster then we tried to patch consul service default and this operation fails with

cannot patch "k8s-test-app-dev" with kind ServiceDefaults: Internal error occurred: failed calling webhook "mutate-servicedefaults.consul.hashicorp.com": Post "https://consul-controller-webhook.consul.svc:443/mutate-v1alpha1-servicedefaults?timeout=10s": x509: certificate signed by unknown authority

Also attaching our helm values.yaml values.txt

andriktr avatar Nov 16 '21 07:11 andriktr

I saw this on a minikube cluster I restarted. Different logs though:

2021-12-01T06:46:58.601Z	ERROR	controller.serviceintentions	Reconciler error	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceIntentions", "name": "backend", "namespace": "default", "error": "Internal error occurred: failed calling webhook \"mutate-serviceintentions.consul.hashicorp.com\": failed to call webhook: Post \"https://consul-controller-webhook.default.svc:443/mutate-v1alpha1-serviceintentions?timeout=10s\": x509: certificate has expired or is not yet valid: current time 2021-12-01T06:46:58Z is after 2021-12-01T02:05:16Z"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2021-11-30T02:05:16.511Z [INFO]  Updated certificate bundle received for consul-connect-injector-cfg; Updating webhook certs.
2021-11-30T02:05:16.595Z [INFO]  Updated certificate bundle received for consul-controller-mutating-webhook-configuration; Updating webhook certs.
2021-11-30T02:05:16.910Z [INFO]  Updating secret with new certificate: mutatingwebhookconfig=consul-controller-mutating-webhook-configuration secret=consul-controller-webhook-cert secretNS=default
2021-11-30T02:05:16.911Z [INFO]  Updating secret with new certificate: mutatingwebhookconfig=consul-connect-injector-cfg secret=consul-connect-inject-webhook-cert secretNS=default
2021-11-30T02:05:16.933Z [INFO]  Updating webhook configuration with new CA: mutatingwebhookconfig=consul-controller-mutating-webhook-configuration secret=consul-controller-webhook-cert secretNS=default
2021-11-30T02:05:16.933Z [INFO]  Updating webhook configuration with new CA: mutatingwebhookconfig=consul-connect-injector-cfg secret=consul-connect-inject-webhook-cert secretNS=default
2021-12-01T03:10:28.001Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:10:28.004Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:11:29.160Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:11:29.161Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:12:30.289Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:12:30.372Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:13:31.322Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:13:31.409Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:14:32.426Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:14:32.442Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:15:33.467Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:15:33.471Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:16:34.489Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:16:34.506Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:17:35.444Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:17:35.631Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:18:36.462Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)"
2021-12-01T03:18:36.672Z [ERROR] failed to reconcile certificates: err="the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps consul-webhook-cert-manager)

Clocks are in sync:

k exec -it consul-webhook-cert-manager-fd75d65cc-jzbh2 -- date
Wed Dec  1 06:50:55 UTC 2021
 k exec -it consul-controller-5596567966-ff4m2 -- date
Wed Dec  1 06:51:15 UTC 2021

lkysow avatar Dec 01 '21 06:12 lkysow

Believe that this might be related to the nodes restart as we saw similar behaviour right after the maintenance where worker nodes were restarted one by one.

andriktr avatar Dec 01 '21 08:12 andriktr

Actually I think I figured out the issue in my case. It was because I had paused and then unpaused Docker desktop.

lkysow avatar Dec 01 '21 16:12 lkysow

Will close as its related to pausing and unpausing nodes which is infra related than Consul K8s related.

david-yu avatar Aug 30 '22 05:08 david-yu