cluster-api
cluster-api copied to clipboard
KCP should persist etcd client certificate
What steps did you take and what happened: [A clear and concise description on how to REPRODUCE the bug.]
- Deploy a vSphere workload cluster from another vSphere management cluster. All ESXi hosts are NTP synced.
- Only one control plane instance appears when 3 replicas were requested. KCP is reporting etcd unhealthy, in the meantime, a worker is created and successfully joined.
- After a very long time, another CP instance is created.
In the etcd logs, we see:
2021-03-11 07:59:48.743560 I | embed: rejected connection from "127.0.0.1:39236" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T07:59:48Z is before 2021-03-11T08:09:21Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 07:59:49.773047 I | embed: rejected connection from "127.0.0.1:39244" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T07:59:49Z is before 2021-03-11T08:09:21Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:06.849986 I | embed: rejected connection from "127.0.0.1:39278" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:06Z is before 2021-03-11T08:09:39Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:07.882861 I | embed: rejected connection from "127.0.0.1:39282" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:07Z is before 2021-03-11T08:09:39Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:24.663342 I | embed: rejected connection from "127.0.0.1:39318" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:24Z is before 2021-03-11T08:09:57Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:26.286498 I | embed: rejected connection from "127.0.0.1:39322" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:26Z is before 2021-03-11T08:09:57Z", ServerName "etcd-work176-master-control-plane-5xrrc")
...
KCP logs show:
I0312 00:06:13.548340 1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="work176" "kubeadmControlPlane"="work176-master-control-plane" "namespace"="work176" "failures"="machine work176-master-control-plane-5xrrc reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the work176-master-control-plane-5xrrc node)"
What did you expect to happen:
KCP is able to connect to etcd in < 20 minutes and create new replicas.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
The etcd logs show that we have a sliding window of certificate validity that keeps moving into the future. For stacked etcd, KCP generates a new etcd client certificate on every connect. This certificate should be persisted and reused, which would cut down on CPU cycles.
In this instance, it appears the behaviour of the cp instance is that for some time period, it's clock was slow behind 15 minutes, but eventually did synchronise. If KCP had generated a single certificate and persisted, we would see the same problem, but only for five minutes (KCP sets a certificate with a NotBefore of time.Now() - 5 minutes).
Environment:
- Cluster-api version: 0.3.14
- Minikube/KIND version:
- Kubernetes version: (use
kubectl version
): - OS (e.g. from
/etc/os-release
):
/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
In this instance, it appears the behaviour of the cp instance is that for some time period, it's clock was slow behind 15 minutes, but eventually did synchronise.
This seems an NTP problem...
/milestone v0.4.0
/milestone Next /priority awaiting-more-evidence
@randomvariable While persisting the certificate could be a good improvement (although that seems mostly on kubeadm side), I'd agree with Fabrizio above that it looks more like an NTP clock skew issue
This seems an NTP problem...
Yes, we're aware, the VM did eventually reach NTP synchronisation, it just took longer than expected. We don't expect to see it very often, but it seems certificate persistence would be useful.
Note that it's not a kubeadm issue, it's KCP's client connection to etcd in https://github.com/kubernetes-sigs/cluster-api/blob/master/controlplane/kubeadm/internal/workload_cluster.go#L389
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
I encountered a similar problem, when I modified the controlplane replica from 1 to 3, it failed. KCP logs show:
I0707 00:32:51.445862 1 controller.go:350] controller-runtime/manager/controller/kubeadmcontrolplane "msg"="Scaling up control plane" "cluster"="innercluster" "name"="innercluster-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" "Desired"=3 "Existing"=1
I0707 00:32:51.445992 1 scale.go:205] "msg"="Waiting for control plane to pass preflight checks" "cluster-name"="innercluster" "name"="innercluster-control-plane" "namespace"="default" "failures"="machine innercluster-control-plane-jmtjk reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the innercluster-control-plane-twpm8 node)"
kcp status:
...
status:
conditions:
- lastTransitionTime: "2021-07-06T12:07:15Z"
message: Scaling up control plane to 3 replicas (actual 1)
reason: ScalingUp
severity: Warning
status: "False"
type: Ready
- lastTransitionTime: "2021-07-06T08:14:11Z"
status: "True"
type: Available
- lastTransitionTime: "2021-07-06T08:08:21Z"
status: "True"
type: CertificatesAvailable
- lastTransitionTime: "2021-07-06T08:14:12Z"
status: "True"
type: ControlPlaneComponentsHealthy
- lastTransitionTime: "2021-07-06T08:14:14Z"
message: 'Following machines are reporting unknown etcd member status: innercluster-control-plane-jmtjk'
reason: EtcdClusterUnknown
status: Unknown
type: EtcdClusterHealthyCondition
- lastTransitionTime: "2021-07-06T08:08:23Z"
status: "True"
type: MachinesCreated
- lastTransitionTime: "2021-07-06T08:09:28Z"
status: "True"
type: MachinesReady
- lastTransitionTime: "2021-07-06T12:07:15Z"
message: Scaling up control plane to 3 replicas (actual 1)
reason: ScalingUp
severity: Warning
status: "False"
type: Resized
initialized: true
observedGeneration: 2
ready: true
readyReplicas: 1
replicas: 1
selector: cluster.x-k8s.io/cluster-name=innercluster,cluster.x-k8s.io/control-plane
updatedReplicas: 1
...
In the controlpalne etcd logs:
2021-07-07 00:18:22.919594 I | embed: rejected connection from "127.0.0.1:48141" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:23.969817 I | embed: rejected connection from "127.0.0.1:48155" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:26.301411 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-07-07 00:18:36.301451 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-07-07 00:18:41.867121 I | embed: rejected connection from "127.0.0.1:48275" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:43.017143 I | embed: rejected connection from "127.0.0.1:48287" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:44.369349 I | mvcc: store.index: compact 132110
2021-07-07 00:18:44.371512 I | mvcc: finished scheduled compaction at 132110 (took 1.901392ms)
2021-07-07 00:18:44.377872 I | embed: rejected connection from "127.0.0.1:48303" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:45.426647 I | embed: rejected connection from "127.0.0.1:48313" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
Is there any solution?
/remove-lifecycle stale
/lifecycle frozen
@chymy you should ensure that there's a good NTP time source in your environment which should prevent the problem occurring. Doing the fix here is an optimisation of sorts.
Two things to ensure in your environment: Your OpenStack hosts are time synced, such that when VMs boot, they are already relatively well synced. And secondly, that each VM guest has a time source configured, either via DHCP Option 42 or via setting NTP servers in the kubeadmconfig and KCP resources.
I ensure the NTP is good. However, In the same environment, after testing, alpha3(should be v0.3.11) is ok, alpha4(v0.3.16) not.
v0.3.16 is still using v1alpha3, v1alpha4 is on v0.4.x
/area control-plane
/assign @yastij to take a closer look
Any updates on this issue?
/triage accepted /unassign @yastij
/help
@fabriziopandini: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
In response to this:
/triage accepted /unassign @yastij
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
(doing some cleanup on old issues without updates) /close there is still a probability that this issue might happen, but we don't have recent reports about this happening so I'm closing this for now
@fabriziopandini: Closing this issue.
In response to this:
(doing some cleanup on old issues without updates) /close there is still a probability that this issue might happen, but we don't have recent reports about this happening so I'm closing this for now
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.