etcd Scheduled compaction is not started on one of the pod in etcd cluster

What happened?

After ETCD upgrade , in logs of one of the 3 node pod , we could see that Scheduled compaction was not started . We could see Scheduled compaction logs are printing 2 of the pods but in one of the pod(i.e. pod-2) not printing Scheduled compaction logs. This is impacting imbalance of revision count among the pods.

What did you expect to happen?

Revision number on all pods should be same , they should not get impacted

How can we reproduce it (as minimally and precisely as possible)?

Its reproducible rarely.

Anything else we need to know?

No response

Etcd version (please run commands below)

bash-4.4$ etcd --version etcd Version: 3.3.11 Git SHA: 2cf9e51d2 Go Version: go1.10.7 Go OS/Arch: linux/amd64 bash-4.4$ etcdctl version etcdctl version: 3.3.11 API version: 3.3 bash-4.4$

Etcd configuration (command line flags or environment variables)

VALID_PARAMETERS=valid ETCD_INITIAL_CLUSTER_TOKEN=dced ETCD_MAX_SNAPSHOTS=3 TZ=UTC HOSTNAME=dced-0 COMPONENT_VERSION=v3.3.11 ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379 ETCD_HEARTBEAT_INTERVAL=100 ETCD_AUTO_COMPACTION_RETENTION=100 DISARM_ALARM_PEER_INTERVAL=6 ETCD_TRUSTED_CA_FILE=/data/combinedca/cacertbundle.pem MONITOR_ALARM_INTERVAL=5 MS_SEC_KEY_MANAGEMENT_SERVICE_HOST=10.102.45.131 MS_SEC_KEY_MANAGEMENT_PORT_8200_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1 ETCDCTL_CERT=/run/sec/certs/client/clicert.pem DEFRAGMENT_ENABLE=true MS_DATA_DISTRIBUTED_COORDINATOR_ED_SERVICE_HOST=10.104.171.77 KUBERNETES_PORT=tcp://10.96.0.1:443 MS_DATA_DISTRIBUTED_COORDINATOR_ED_SERVICE_PORT=2379 PWD=/ ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380 HOME=/home/dced MS_DATA_DISTRIBUTED_COORDINATOR_ED_SERVICE_PORT_CLIENT_PORT_TLS=2379 ETCD_AUTO_COMPACTION_MODE=revision KUBERNETES_SERVICE_PORT_HTTPS=443 MS_DATA_DISTRIBUTED_COORDINATOR_ED_PORT_2379_TCP_ADDR=10.104.171.77 KUBERNETES_PORT_443_TCP_PORT=443 ETCD_DEBUG=false MS_SEC_KEY_MANAGEMENT_SERVICE_PORT_HTTPS_KMS=8200 ETCD_CERT_FILE=/run/sec/certs/server/srvcert.pem ETCD_FIFO_DIR=/fifo ETCD_PEER_AUTO_TLS=true MS_DATA_DISTRIBUTED_COORDINATOR_ED_PORT_2379_TCP_PORT=2379 KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443 MS_DATA_DISTRIBUTED_COORDINATOR_ED_PORT_2379_TCP=tcp://10.104.171.77:2379 DEFRAGMENT_PERIODIC_INTERVAL=60 COMPONENT=etcd ETCD_DATA_DIR=/data ETCD_LOG_PACKAGE_LEVELS=etcdserver=INFO,security=INFO ETCD_CLIENT_CERT_AUTH=true TERM=xterm MS_SEC_KEY_MANAGEMENT_PORT=tcp://10.102.45.131:8200 ETCDCTL_ENDPOINTS=dced.zmorrah:2379 ETCD_METRICS=basic ETCDCTL_API=3 MS_DATA_DISTRIBUTED_COORDINATOR_ED_PORT=tcp://10.104.171.77:2379 ETCD_SNAPSHOT_COUNT=5000 ETCD_MAX_WALS=3 SHLVL=1 MS_SEC_KEY_MANAGEMENT_PORT_8200_TCP_ADDR=10.102.45.131 KUBERNETES_SERVICE_PORT=443 ETCD_INITIAL_ADVERTISE_PEER_URLS=https://dced-0.dced-peer.zmorrah.svc.cluster.local:2380 ETCD_KEY_FILE=/run/sec/certs/server/srvprivkey.pem ETCD_ENABLE_V2=false ETCD_ELECTION_TIMEOUT=1000 ETCDCTL_CACERT=/data/combinedca/cacertbundle.pem ETCD_NAME=dced-0 ETCD_QUOTA_BACKEND_BYTES=268435456 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin ETCD_ADVERTISE_CLIENT_URLS=https://dced-0.dced.zmorrah:2379 MS_SEC_KEY_MANAGEMENT_SERVICE_PORT=8200 KUBERNETES_SERVICE_HOST=10.96.0.1 FLAVOUR=etcd-v3.3.11-linux-amd64 MS_SEC_KEY_MANAGEMENT_PORT_8200_TCP=tcp://10.102.45.131:8200 MS_DATA_DISTRIBUTED_COORDINATOR_ED_PORT_2379_TCP_PROTO=tcp MS_SEC_KEY_MANAGEMENT_PORT_8200_TCP_PORT=8200 ETCDCTL_KEY=/run/sec/certs/client/cliprivkey.pem _=/usr/bin/env

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

You can refer to following log lines which shows ""finished scheduled compaction"" this log lines is coming only in pod 0 and pod 1 .
Its not coming in pod2

dced0.txt dced1.txt dced2.txt

Apr 25 '22 16:04 rahulbapumore

Hi @ahrtr , Could you please help us with this ??

Thanks, Rahul

Apr 28 '22 05:04 rahulbapumore

3.3.* is end of life. Can you try this in 3.5.*?

Similar to this reply https://github.com/etcd-io/etcd/issues/13918#issuecomment-1096845232

Apr 28 '22 22:04 lavacat

Hi @lavacat , Could you please confirm whether our issue and the issue you mentioned in #13918 are same/similar?

Thanks, Rahul

Apr 29 '22 07:04 rahulbapumore

Hi @lavacat , Could you please confirm above comment?

Thanks

May 02 '22 05:05 rahulbapumore

I've mentioned https://github.com/etcd-io/etcd/issues/13918#issuecomment-1096845232 as an example of end of life comment. The issue isn't related. Maybe I should find a better reference in the docs.

May 02 '22 17:05 lavacat

Hi @lavacat, We have one more query, #11817 this ticket tells about one fix done in etcd regrading deadlock bug. And we suspect that we are seeing our issue(of compaction not done in one of pod) because of deadlock condition bug which is solved in https://github.com/etcd-io/etcd/pull/11817. Could you confirm that our issue is coming because of https://github.com/etcd-io/etcd/pull/11817 3pp ticket?

Note - We are having 3 node etcd cluster and we are using ETCD 3.3.11. And we havent seen this compaction issue on ETCD 3.4.16

Thanks, Rahul

May 03 '22 07:05 rahulbapumore

Hi @lavacat , Any confirmation?

Thanks, Rahul

May 05 '22 08:05 rahulbapumore

Hi @lavacat , Could you please confirm?

Thanks

May 06 '22 06:05 rahulbapumore

Hi @lavacat , Any updates?

Thanks

May 10 '22 06:05 rahulbapumore

Hi @lavacat @ahrtr , Any updates on above query?

May 13 '22 05:05 rahulbapumore

Hi @lavacat @ahrtr , Any updates on above query?

May 24 '22 11:05 rahulbapumore

Hi @lavacat @ahrtr , Any updates on above query?

May 30 '22 08:05 rahulbapumore

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sep 21 '22 02:09 stale[bot]

etcd etcd copied to clipboard

Scheduled compaction is not started on one of the pod in etcd cluster

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

etcd
etcd copied to clipboard