cluster-api KCP should persist etcd client certificate

What steps did you take and what happened: [A clear and concise description on how to REPRODUCE the bug.]

Deploy a vSphere workload cluster from another vSphere management cluster. All ESXi hosts are NTP synced.
Only one control plane instance appears when 3 replicas were requested. KCP is reporting etcd unhealthy, in the meantime, a worker is created and successfully joined.
After a very long time, another CP instance is created.

In the etcd logs, we see:

2021-03-11 07:59:48.743560 I | embed: rejected connection from "127.0.0.1:39236" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T07:59:48Z is before 2021-03-11T08:09:21Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 07:59:49.773047 I | embed: rejected connection from "127.0.0.1:39244" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T07:59:49Z is before 2021-03-11T08:09:21Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:06.849986 I | embed: rejected connection from "127.0.0.1:39278" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:06Z is before 2021-03-11T08:09:39Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:07.882861 I | embed: rejected connection from "127.0.0.1:39282" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:07Z is before 2021-03-11T08:09:39Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:24.663342 I | embed: rejected connection from "127.0.0.1:39318" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:24Z is before 2021-03-11T08:09:57Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:26.286498 I | embed: rejected connection from "127.0.0.1:39322" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:26Z is before 2021-03-11T08:09:57Z", ServerName "etcd-work176-master-control-plane-5xrrc")
...

KCP logs show:

I0312 00:06:13.548340       1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="work176" "kubeadmControlPlane"="work176-master-control-plane" "namespace"="work176" "failures"="machine work176-master-control-plane-5xrrc reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the work176-master-control-plane-5xrrc node)"

What did you expect to happen:

KCP is able to connect to etcd in < 20 minutes and create new replicas.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

The etcd logs show that we have a sliding window of certificate validity that keeps moving into the future. For stacked etcd, KCP generates a new etcd client certificate on every connect. This certificate should be persisted and reused, which would cut down on CPU cycles.

In this instance, it appears the behaviour of the cp instance is that for some time period, it's clock was slow behind 15 minutes, but eventually did synchronise. If KCP had generated a single certificate and persisted, we would see the same problem, but only for five minutes (KCP sets a certificate with a NotBefore of time.Now() - 5 minutes).

Environment:

Cluster-api version: 0.3.14
Minikube/KIND version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

Mar 12 '21 18:03 randomvariable

In this instance, it appears the behaviour of the cp instance is that for some time period, it's clock was slow behind 15 minutes, but eventually did synchronise.

This seems an NTP problem...

Mar 12 '21 20:03 fabriziopandini

/milestone v0.4.0

Mar 12 '21 20:03 fabriziopandini

/milestone Next /priority awaiting-more-evidence

@randomvariable While persisting the certificate could be a good improvement (although that seems mostly on kubeadm side), I'd agree with Fabrizio above that it looks more like an NTP clock skew issue

Mar 12 '21 22:03 vincepri

This seems an NTP problem...

Yes, we're aware, the VM did eventually reach NTP synchronisation, it just took longer than expected. We don't expect to see it very often, but it seems certificate persistence would be useful.

Note that it's not a kubeadm issue, it's KCP's client connection to etcd in https://github.com/kubernetes-sigs/cluster-api/blob/master/controlplane/kubeadm/internal/workload_cluster.go#L389

Mar 15 '21 12:03 randomvariable

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Jun 13 '21 12:06 fejta-bot

I encountered a similar problem, when I modified the controlplane replica from 1 to 3, it failed. KCP logs show:

I0707 00:32:51.445862       1 controller.go:350] controller-runtime/manager/controller/kubeadmcontrolplane "msg"="Scaling up control plane" "cluster"="innercluster" "name"="innercluster-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" "Desired"=3 "Existing"=1
I0707 00:32:51.445992       1 scale.go:205]  "msg"="Waiting for control plane to pass preflight checks" "cluster-name"="innercluster" "name"="innercluster-control-plane" "namespace"="default" "failures"="machine innercluster-control-plane-jmtjk reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the innercluster-control-plane-twpm8 node)"

kcp status:

...
status:
    conditions:
    - lastTransitionTime: "2021-07-06T12:07:15Z"
      message: Scaling up control plane to 3 replicas (actual 1)
      reason: ScalingUp
      severity: Warning
      status: "False"
      type: Ready
    - lastTransitionTime: "2021-07-06T08:14:11Z"
      status: "True"
      type: Available
    - lastTransitionTime: "2021-07-06T08:08:21Z"
      status: "True"
      type: CertificatesAvailable
    - lastTransitionTime: "2021-07-06T08:14:12Z"
      status: "True"
      type: ControlPlaneComponentsHealthy
    - lastTransitionTime: "2021-07-06T08:14:14Z"
      message: 'Following machines are reporting unknown etcd member status: innercluster-control-plane-jmtjk'
      reason: EtcdClusterUnknown
      status: Unknown
      type: EtcdClusterHealthyCondition
    - lastTransitionTime: "2021-07-06T08:08:23Z"
      status: "True"
      type: MachinesCreated
    - lastTransitionTime: "2021-07-06T08:09:28Z"
      status: "True"
      type: MachinesReady
    - lastTransitionTime: "2021-07-06T12:07:15Z"
      message: Scaling up control plane to 3 replicas (actual 1)
      reason: ScalingUp
      severity: Warning
      status: "False"
      type: Resized
    initialized: true
    observedGeneration: 2
    ready: true
    readyReplicas: 1
    replicas: 1
    selector: cluster.x-k8s.io/cluster-name=innercluster,cluster.x-k8s.io/control-plane
    updatedReplicas: 1
...

In the controlpalne etcd logs:

2021-07-07 00:18:22.919594 I | embed: rejected connection from "127.0.0.1:48141" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:23.969817 I | embed: rejected connection from "127.0.0.1:48155" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:26.301411 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-07-07 00:18:36.301451 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-07-07 00:18:41.867121 I | embed: rejected connection from "127.0.0.1:48275" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:43.017143 I | embed: rejected connection from "127.0.0.1:48287" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:44.369349 I | mvcc: store.index: compact 132110
2021-07-07 00:18:44.371512 I | mvcc: finished scheduled compaction at 132110 (took 1.901392ms)
2021-07-07 00:18:44.377872 I | embed: rejected connection from "127.0.0.1:48303" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:45.426647 I | embed: rejected connection from "127.0.0.1:48313" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")

Is there any solution?

/remove-lifecycle stale

Jul 07 '21 00:07 chymy

/lifecycle frozen

Jul 07 '21 15:07 randomvariable

@chymy you should ensure that there's a good NTP time source in your environment which should prevent the problem occurring. Doing the fix here is an optimisation of sorts.

Two things to ensure in your environment: Your OpenStack hosts are time synced, such that when VMs boot, they are already relatively well synced. And secondly, that each VM guest has a time source configured, either via DHCP Option 42 or via setting NTP servers in the kubeadmconfig and KCP resources.

Jul 07 '21 15:07 randomvariable

I ensure the NTP is good. However, In the same environment, after testing, alpha3(should be v0.3.11) is ok, alpha4(v0.3.16) not.

Jul 08 '21 00:07 chymy

v0.3.16 is still using v1alpha3, v1alpha4 is on v0.4.x

Jul 20 '21 15:07 vincepri

/area control-plane

Nov 02 '21 14:11 randomvariable

/assign @yastij to take a closer look

Feb 11 '22 18:02 sbueringer

Any updates on this issue?

Jul 06 '22 22:07 vincepri

/triage accepted /unassign @yastij

/help

Oct 03 '22 17:10 fabriziopandini

@fabriziopandini: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/triage accepted /unassign @yastij

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 03 '22 17:10 k8s-ci-robot

(doing some cleanup on old issues without updates) /close there is still a probability that this issue might happen, but we don't have recent reports about this happening so I'm closing this for now

Mar 24 '23 16:03 fabriziopandini

@fabriziopandini: Closing this issue.

In response to this:

(doing some cleanup on old issues without updates) /close there is still a probability that this issue might happen, but we don't have recent reports about this happening so I'm closing this for now

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 24 '23 16:03 k8s-ci-robot

cluster-api cluster-api copied to clipboard

KCP should persist etcd client certificate

Guidelines

cluster-api
cluster-api copied to clipboard