k3s Cannot replace etcd nodes over certain sizes

The hardcoded 30s timeout is too low for defragmentation to complete on larger (8+G) etcd shards:

https://github.com/k3s-io/k3s/blob/c0d661b334bc3cbe30d80e5aab6af4b92d3eb503/pkg/etcd/etcd.go#L58

I realize we're well beyond the "standard" performance envelope of etcd but it works just fine for us with multiple clusters where etcd is 8G or even 10G in size (our quota limit is 16G). Literally the only thing making this near-impossible to scale beyond is this hardcoded timeout, would it be possible to make it configurable? Can PR if this would be of interest...

Note we're using rke2 (1.26.15), but this is the timeout we're hitting.

Oct 18 '24 13:10 Fizzadar

Can you provide some information on your environment? Specifically:

How long DOES a defrag take on these nodes? Can you provide the output of time etcdctl defrag [...] on this node?
How fragmented is your datastore? Can you provide the output of etcdctl endpoint status -w json?
What sort of disks are you using? Can you leave iostat -x -d 2 running before starting the defrag, and capture its output through the end of the defrag process?

Oct 18 '24 18:10 brandond

Those are dedicated nodes but their disk throughput wasn't enough to complete an eight gig defrag in 30 seconds. It's just tad too tight at our scale and started to go beyond 30 seconds, less than 40 anyway. We have just shy of 100k pods and 100 nodes on these clusters and a lot of ConfigMaps and such which balloons etcd quite a bit so we do expect it to require more than average amount of time to do things.

We also had to rescue a cluster during this weekend which ended up losing all three control nodes around the same time due to rke2-server service being stuck in timing out waiting for defrag. Our quick fix was to rebuild the rke2 binary with higher timeout as requested here so that waiting for the defrag completed and we could get our control plane back up.

Either having it configurable or the default higher would be helpful. If there's no specific reason for 30 seconds I'd suggest making it 5 minutes to allow it enough time in all kinds of resource constraint situations. Some control planes run on VMs with non-guaranteed performance so hitting this at 30 seconds is more than possible with a smaller cluster as well.

I'm not exactly sure which project is responsible for what parts of the RKE2 stack so we may need to open another issue on RKE2 side for it not to fail to start if defrag doesn't complete in the expected time as it shouldn't be a hard requirement for startup.

Oct 21 '24 05:10 hifi

Thanks for the additional information about your environment. Can you provide any of the specifically requested information regarding performance of the nodes in question? While just changing or allowing configuration of the timeout is certainly and option, we'd like to also better understand what performance profiles makes this necessary in the first place.

Oct 21 '24 16:10 brandond

What we'll probably do is move the defrag out from the etcd status check context deadline, so that the 30 second timeout does not affect the defrag and alarm clear operations. At that point we can evaluate if a timeout on the defrag is even necessary.

If anyone with a large datastore affected by this issue can provide the info requested at https://github.com/k3s-io/k3s/issues/11122#issuecomment-2423000976 that would be appreciated

Oct 23 '24 20:10 brandond

Related Internal Requests: SURE-9222 SURE-9233

Oct 23 '24 20:10 caroline-suse-rancher

Ah, sorry for not getting back faster. We replaced the nodes with even more powerful ones (though both do have NVMe drives) and got the times to around 17 seconds.

Thanks for landing a fix!

Oct 24 '24 12:10 hifi

Hey all , after several tries we couldnt reproduce the issue.

Followed approach:

used small machines
used dd to try diminish etcd size cap
used fioto try adding a overload environment on I/O before getting defrag
added a lot of pods,cm,secrets on the cluster
added a lot of data inside etcd
deleted a lot of data
repeat 2 steps above
added more data
defrag

But max time able to get was 7s

Due to this PR introduced unit tests related to changes, we are going to close it.

PS: tested snapshot local, snapshot to s3 and cluster restore . All good!

Nov 12 '24 11:11 fmoral2