autoscaling icon indicating copy to clipboard operation
autoscaling copied to clipboard

Healthchecks and autorestarts for computes

Open olegbbtr opened this issue 1 year ago • 3 comments

Problem description / Motivation

Branched off from https://github.com/neondatabase/cloud/issues/14114

At this moment, we can only rely on k8s's signal for compute unavailability, specifically, container process monitoring.

We would like to have an end-to-end healthcheck, which would allow us to detect problems, such as:

  1. Postgres does not accept connections
  2. compute_ctl crashlooping
  3. Network partitioning

Feature idea(s) / DoD

We have a healthcheck mechanism, allowing us to detect compute issues within <30s, and taking appropriate actions, such as restarting.

Implementation ideas

We should have a piece of code inside vm which would respond to a healthcheck.

olegbbtr avatar Sep 20 '24 14:09 olegbbtr

Not sure we will need that or if Kubernetes is good enough. Putting in the backlog for now.

stradig avatar Sep 23 '24 15:09 stradig

I wonder if we can add some generic health check mechanism as part of neondatabase/cloud#27103? cc @hlinnaka

sharnoff avatar Apr 17 '25 10:04 sharnoff

This issue was moved to Jira: LKB-2137

zenithdb-bot-dev[bot] avatar Jul 21 '25 12:07 zenithdb-bot-dev[bot]