zksync-era icon indicating copy to clipboard operation
zksync-era copied to clipboard

feat(healthcheck): Various healthcheck improvements

Open slowli opened this issue 2 years ago • 1 comments

What ❔

  • Adds HeathStatus::ShuttingDown set immediately after a component receives a termination signal. Makes the /health endpoint conforming to K8s readiness probe expectations.
  • Makes slow / hard time limits for health checks configurable and decreases their values by default.
  • Adds metric for slow, timed out and dropped health checks.

Why ❔

Improves healthcheck observability.

Checklist

  • [x] PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • [x] Tests for the changes have been added / updated.
  • [x] Documentation comments have been added / updated.
  • [x] Code has been formatted via zk fmt and zk lint.
  • [x] Spellcheck has been run via zk spellcheck.
  • [x] Linkcheck has been run via zk linkcheck.

slowli avatar Feb 21 '24 10:02 slowli

Just in case: I've checked that if /heath is requested with a small client timeout (e.g., using curl -m ..) so that it doesn't complete in time, then axum drops the handling future together with pending futures it depends on (in particular, CheckHealth::check_health() implementations). So a drop guard added in this PR will actually be triggered in this case.

slowli avatar Feb 21 '24 11:02 slowli