etcd-backup-restore icon indicating copy to clipboard operation
etcd-backup-restore copied to clipboard

[BUG] Long validation duration for large ETCDs that get OOMKilled

Open shreyas-s-rao opened this issue 5 years ago • 1 comments

Describe the bug: When the etcd data is large (>1GB) and get oomkilled, backup sidecar takes very long (>5hr) to perform validation of the data directory and start etcd again. During this period, readiness is set to true, due to which apiserver tries to send traffic to etcd.

Expected behavior: Validation should not take that long, and readiness probe should not be set to true during data validation.

How To Reproduce (as minimally and precisely as possible):

Logs:

Screenshots (if applicable):

Environment (please complete the following information):

  • Etcd version/commit ID : 3.3.17
  • Etcd-backup-restore version/commit ID: 0.8.0
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]: All

Anything else we need to know?:

shreyas-s-rao avatar Apr 20 '20 06:04 shreyas-s-rao

Unable to reproduce long validation period locally. Sample etcd data of 5GB took 5 hours to validate when in a canary cluster, but took less than a minute when trying to reproduce issue on different cluster having similar/lesser memory and CPU resources. Readiness probe issue was also not reproducible in this setup. Will keep this issue open for now and close it after 2 weeks if no more occurrences of the issue are observed.

shreyas-s-rao avatar May 04 '20 10:05 shreyas-s-rao

/close

abdasgupta avatar Jan 05 '23 10:01 abdasgupta

Closed since the issue did not re-occur in the past 2 years.

shreyas-s-rao avatar Jan 05 '23 10:01 shreyas-s-rao