kubeadm Proposal: Add Network Latency Pre-check to kubeadm join Process

What keywords did you search in kubeadm issues before filing this one?

Searched for "kubeadm join failure", "node join failure", "pre-checks for kubeadm join".

Is this a BUG REPORT or FEATURE REQUEST?

Its a FEATURE REQUEST

What happened?

In our current setup, we use Cluster API for the programmatic management of kubernetes management and workload clusters. During the process of upgrade for these workload cluster, we progamattically patch some resources within some controllers so that Cluster API calls into the infrastructure provider to create a new VM with the updated image and initiates a kubeadm join for this new node (Node2) to initial node(Node1). Once the operation is complete, drain operations are performed on Node1 and Node1 is deleted We encountered an issue where a node brought up by Cluster API failed to join the existing node, leaving the cluster in a bad state.

During debugging the issue, we found that some operations were taking a bit longer than the expected and some context deadline issues in the logs. Further, we executed etcdctl check perf from the failing node (Node2) and it failed due to low throughput. However, the same command passed when executed from the first node (Node1) with the endpoint of the Node1 etcd server. Interestingly, it failed when executed from Node2 using the endpoint of the Node1 etcd server. This discrepancy suggests that network latencies could be the root cause, as the first result indicates that the etcd server on Node1 is functioning properly.

What you expected to happen?

While it's possible to perform manual checks before attempting a kubeadm join, in our use case, the process is automated through Cluster API. Manual checks are not feasible in this scenario. We do not know how the networking is going to be from Node2 to Node1 until its brought up. Therefore, we propose to add a pre-check in the kubeadm join process itself.

Proposal

We propose to add a pre-check before attempting kubeadm join to ensure the network latencies between the joining node and the existing nodes are within acceptable limits. This can help fail fast if the network conditions are not suitable for adding a new node to the cluster.

This could be a simple check that attempts to connect to the etcd server on the current nodes and perform a simple Put followed by a Get operation. If the operation fails or takes longer than the specified timeout, the function returns an error. This can be performed in the kubeadm join pre-flight checks.

Motivation

The motivation behind this proposal is to improve the reliability of the kubeadm join process and prevent clusters from entering a bad state due to failed node joins. By adding these pre-checks, we can get some idea on the network conditions and see if they are suitable for a new node to join the cluster. This can prevent failures that leave the cluster in a bad state and improve the overall stability of our automated processes.

Feb 28 '24 04:02 PavanNeerudu

thanks for the detailed information.

This could be a simple check that attempts to connect to the etcd server on the current nodes and perform a simple Put followed by a Get operation. If the operation fails or takes longer than the specified timeout, the function returns an error. This can be performed in the kubeadm join pre-flight checks.

there is a bit of a problem for setups that don't have the etcd image yet. the etcd image is downloaded on preflight, so a potential etcd cluster check must be executed only after we have downloaded the etcd image. we want to use etcdctl from the etcd image and not embed its code in kubeadm, but note that there was also a plan to delete etcdctl from the official etcd image from k8s, so using etcdctl might not be a good idea for us long term.

can you show us an example full etcdctl check perf command. i assume it needs mTLS to talk the the existing cluster? if it needs mTLS this would be a problem in terms of cert preparation.

another problem to think of, the etcd cluster logic is formed much later in the join process - i.e. when the etcd member to join the cluster is prepared, so preflight is too early and this might require a undesired rewrite of kubeadm phases.

given these complexities, my initial reaction is to have it to the user to perform any network / etcd checks manually. speaking of network checks, what alternative solutions do you have in mind, other than etcdctl?

@pacoxu @chrischdi PTAL for comments.

Feb 28 '24 07:02 neolit123

IMHO, running etcdctl check perf is not suitable to run for every (production?) cluster when joining a control-plane node.

It causes load to the cluster and may impact the running (production) applications.
This command also writes data to the production etcd cluster (and cleans it up afterwards) which is used by the kube-apiserver.

I would not recommend to do this especially for a running production cluster.

Cluster-APIs preKubeadmCommands could get used to run any kind of custom script or binary to do customized pre-checks.

Feb 28 '24 09:02 chrischdi

@neolit123 , @chrischdi .Thank you for your comments. We used etcdctl for debugging by generating some load from the node that failed to join the existing node. The check was failing, while the same command run against the endpoint of the first node from the first node showed good latency and throughput.

In our case, the bottleneck might be the network rather than the healthiness of the etcd cluster. We also used traceroute to examine the requests from node2 to node1's etcd endpoint, and the results were inconsistent, showing long response times. I agree with the concerns about mTLS for communication, but that comes into play only when using commands that interact with the etcd cluster, right? In our case, a simple traceroute from the node-to-be-joined to the existing etcd endpoint could be helpful. This could also be achieved programmatically using client libraries.

@chrischdi I thought that a similar thing could be achieved with preKubeadmCommands on the KubeadmControlPlane resource as well. I was considering a simple traceroute to the etcd endpoint to see if the latency is within an acceptable level in the pre-kubeadm commands to fail early. We could add some retries and consider the mean time for this operation as well. And any failed pre-kubeadm commands would result in the machine being handled by the MachineHealthChecks we define in the Infrastructure Provider, right?

However, I believe it would also be valid to request that kubeadm perform some basic latency checks in the pre-flight checks from the node-to-be-joined to the existing node to guard against such kind of problems.

Feb 28 '24 09:02 PavanNeerudu

However, I believe it would also be valid to request that kubeadm perform some basic latency checks in the pre-flight checks from the node-to-be-joined to the existing node to guard against such kind of problems.

can you provide some examples?

Feb 28 '24 10:02 neolit123

However, I believe it would also be valid to request that kubeadm perform some basic latency checks in the pre-flight checks from the node-to-be-joined to the existing node to guard against such kind of problems.

can you provide some examples?

I am not a network expert but some things that can be done are:

traceroute: Can be used to trace the route that packets take source to the destination host. traceroute etcd-endpoint. Notice the very high latencies in these calls from Node2.
mtr: a better tool than above. It has combined functionalities of tracceroute and mtr.

I am not sure, but netperf/curl could be used for same to check the latencies. There should be libraries that expose these commands programmatically for us to consume. But I think, this would require active experimentation to see what kind of latencies does the joining node etcd expects to the existing etcd cluster. Also, to guard against false positives(probably retries/aggregated results) and define a right thresold. As too bad a threshold could potentially make the check useless.

Feb 28 '24 11:02 PavanNeerudu

I am not sure, but netperf/curl could be used for same to check the latencies. There should be libraries that expose these commands programmatically for us to consume.

if this is added i think it should leverage go libraries.

But I think, this would require active experimentation to see what kind of latencies does the joining node etcd expects to the existing etcd cluster. Also, to guard against false positives(probably retries/aggregated results) and define a right thresold. As too bad a threshold could potentially make the check useless.

is etcd the only target or also LB / apiserver?

Feb 28 '24 11:02 neolit123

i'm going to tag this as needs-kep, given the open questions of what to use, what to perform and what are the targets https://github.com/kubernetes/enhancements/blob/master/keps/README.md

Feb 28 '24 11:02 neolit123

added this to the v1.31 milestone

Thank you for considering this feature and adding it to the milestone. I understand that there are open questions regarding what tool to use, checks and what the targets should be.

Let me see if I can contribute to the KEP process directly but I'm very intrested in following the progress and providing any user perspective/feedback. Thanks again!

Feb 28 '24 11:02 PavanNeerudu

added this to the v1.31 milestone

Thank you for considering this feature and adding it to the milestone. I understand that there are open questions regarding what tool to use, checks and what the targets should be.

Let me see if I can contribute to the KEP process directly but I'm very intrested in following the progress and providing any user perspective/feedback. Thanks again!

no problem, note if we don't reach an agreement on a proposal doc for the next release 1.31 (not 1.30), we should probably close this ticket and leave it to users or high level tools to orchestrate the checks that they need.

Feb 28 '24 11:02 neolit123

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 28 '24 12:05 k8s-triage-robot

kep deadline for 1.31 is approaching and we have no indication for an alpha kep. let's reopen if needed.

May 28 '24 15:05 neolit123

kubeadm kubeadm copied to clipboard

Proposal: Add Network Latency Pre-check to kubeadm join Process

What keywords did you search in kubeadm issues before filing this one?

Is this a BUG REPORT or FEATURE REQUEST?

What happened?

What you expected to happen?

Proposal

Motivation

kubeadm
kubeadm copied to clipboard