autoscaling
autoscaling copied to clipboard
Healthchecks and autorestarts for computes
Problem description / Motivation
Branched off from https://github.com/neondatabase/cloud/issues/14114
At this moment, we can only rely on k8s's signal for compute unavailability, specifically, container process monitoring.
We would like to have an end-to-end healthcheck, which would allow us to detect problems, such as:
- Postgres does not accept connections
- compute_ctl crashlooping
- Network partitioning
Feature idea(s) / DoD
We have a healthcheck mechanism, allowing us to detect compute issues within <30s, and taking appropriate actions, such as restarting.
Implementation ideas
We should have a piece of code inside vm which would respond to a healthcheck.
Not sure we will need that or if Kubernetes is good enough. Putting in the backlog for now.
I wonder if we can add some generic health check mechanism as part of neondatabase/cloud#27103? cc @hlinnaka
This issue was moved to Jira: LKB-2137