etcd-backup-restore
etcd-backup-restore copied to clipboard
[BUG] Long validation duration for large ETCDs that get OOMKilled
Describe the bug: When the etcd data is large (>1GB) and get oomkilled, backup sidecar takes very long (>5hr) to perform validation of the data directory and start etcd again. During this period, readiness is set to true, due to which apiserver tries to send traffic to etcd.
Expected behavior: Validation should not take that long, and readiness probe should not be set to true during data validation.
How To Reproduce (as minimally and precisely as possible):
Logs:
Screenshots (if applicable):
Environment (please complete the following information):
- Etcd version/commit ID : 3.3.17
- Etcd-backup-restore version/commit ID: 0.8.0
- Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]: All
Anything else we need to know?:
Unable to reproduce long validation period locally. Sample etcd data of 5GB took 5 hours to validate when in a canary cluster, but took less than a minute when trying to reproduce issue on different cluster having similar/lesser memory and CPU resources. Readiness probe issue was also not reproducible in this setup. Will keep this issue open for now and close it after 2 weeks if no more occurrences of the issue are observed.
/close
Closed since the issue did not re-occur in the past 2 years.