feat(capture): add explicit _readiness probe behavior tied to shutdown signal
Problem
We see bursts of closed connections in Envoy across deploy boundaries in capture deployments. The root metric surfaced in Envoy is envoy_cluster_upstream_cx_destroy_remote_with_active_rq which means our app is closing connections out while Envoy awaits a response to an in-flight request.
The reason could be that capture does not properly trigger changes to the _readiness endpoint before taking down the listener entirely and triggering shutdown. In this scenario, Envoy is still sending connections to outgoing pods right during a deploy cutover, right up until the pod listener is dropped.
Changes
- Implement a simple
_readinessstatus change when k8s signals (SIGTERM/SIGKILL) to shut down the pod - Sleep for a configurable window prior to completing shutdown so k8s LB (and Envoy) have time to figure this out before the app stops listening and attempts to finish in-flights
Let's see what effect this has in mirror, and iterate from there. Handling readiness probes properly is good practice anyway.
Q: Why didn't you reuse the HealthRegistry for this?
I considered it but didn't want unrelated global changes in app health to bleed over into the _readiness probe behavior until I've seen this act strictly on deploy cutovers.
How did you test this code?
Locally and CI; will smoke test in mirror deploy & eval results prior to merge
👉 Stay up-to-date with PostHog coding conventions for a smoother review.
Changelog: (features only) Is this feature complete?
N/A