vcluster
vcluster copied to clipboard
ETCD Add option to setup quota-backend-bytes using helm value file
Is your feature request related to a problem?
Problem started to appear on one of our tenants, which started to trow error like etcdhttp/metrics.go:79 /health error ALARM NOSPACE status-cod 503 on etcd members and etcd nodes health check constantly failed. Consequently, etcd failed to start and vcluster became unusable.
Which solution do you suggest?
On the vcluster etcd stateful set, add option to setup --quota-backend-bytes and/or perhaps set a default value to 4294967296 (4GB) that can be overwritten via helm config value as well as those two other command --auto-compaction-mode=periodic and --auto-compaction-retention=30m
Also, add documentation in order to be able to fix the issue. Below, here's what I did on our side:
Pause the cluster
vcluster pause -n vcluster-test1 vc1
then restart statefulset vc1-etcd
kubectl scale-n vcluster-test1 sts/vc1-etcd --replicas=3
Connect to etcd-0
kubectl -n vcluster-test1 exec -ti vc1-etcd-0 sh
export the following
export ETCD_SRVNAME=vc1-etcd-0
NOTE: On each pod shell, export
ETCD_SRVNAMEwith the pod name value (vc1-etcd-0, vc1-etcd-1, vc1-etcd-3)
Get the current revision number
etcdctl endpoint status --write-out json \
--endpoints=https://$ETCD_SRVNAME:2379 \
--cacert=/run/config/pki/etcd-ca.crt \
--key=/run/config/pki/etcd-peer.key \
--cert=/run/config/pki/etcd-peer.crt
Compact the database
etcdctl --command-timeout=600s compact <revision number> \
--endpoints=https://$ETCD_SRVNAME:2379 \
--cacert=/run/config/pki/etcd-ca.crt \
--key=/run/config/pki/etcd-peer.key \
--cert=/run/config/pki/etcd-peer.crt
Run an etcd defrag
etcdctl --command-timeout=600s defrag \
--endpoints=https://$ETCD_SRVNAME:2379 \
--cacert=/run/config/pki/etcd-ca.crt \
--key=/run/config/pki/etcd-peer.key \
--cert=/run/config/pki/etcd-peer.crt
NOTE: Repeat defrag step on each etcd members.
Confirm the disk usage has been reduced
etcdctl endpoint status -w table \
--endpoints=https://$ETCD_SRVNAME:2379 \
--cacert=/run/config/pki/etcd-ca.crt \
--key=/run/config/pki/etcd-peer.key \
--cert=/run/config/pki/etcd-peer.crt
Then remove the NOSPACE alarm
etcdctl alarm disarm \
--endpoints=https://$ETCD_SRVNAME:2379 \
--cacert=/run/config/pki/etcd-ca.crt \
--key=/run/config/pki/etcd-peer.key \
--cert=/run/config/pki/etcd-peer.crt
Now edit the vcluster stateful set manually and add new command arg
--auto-compaction-mode=periodic
--auto-compaction-retention=30m
--quota-backend-bytes=8589934592
Finally, resume cluster
vcluster resume vc1
Which alternative solutions exist?
None, unless editing stateful set manually then add new command arg
--auto-compaction-mode=periodic
--auto-compaction-retention=30m
--quota-backend-bytes=8589934592
Additional context
Current vcluster version 0.11.1 Kubernetes 1.23.7 Vcluster distro: k8s HA
Nevermind, I didn't realize that etcd setting support extraArgs in helm value. Though, adding --auto-compaction-mode, --auto-compaction-retention and --quota-backend-bytes in the documentation would be a great help as well as adding in the troubleshooting section the fix for error etcdhttp/metrics.go:79 /health error ALARM NOSPACE status-cod 503
I don't have much expertise when it comes to tweaking etcd options, but if somebody can raise a PR for this issue and back the recommendations by reputable sources then I can review the PR and help to get it over the line. Based on this I'll add the "help-wanted" label. @iMikeG6 would you be interested in contributing a PR for this? You seem to know a lot about etcd. :)
I'm not an ETCD expert, I've simply googled an found some post that talk about a similar issues. My hope is that it will help other people who'll face the same issue I had.