feat(capture): add explicit _readiness probe behavior tied to shutdown signal

Open eli-r-ph opened this issue 1 month ago • 0 comments

Problem

We see bursts of closed connections in Envoy across deploy boundaries in capture deployments. The root metric surfaced in Envoy is envoy_cluster_upstream_cx_destroy_remote_with_active_rq which means our app is closing connections out while Envoy awaits a response to an in-flight request.

The reason could be that capture does not properly trigger changes to the _readiness endpoint before taking down the listener entirely and triggering shutdown. In this scenario, Envoy is still sending connections to outgoing pods right during a deploy cutover, right up until the pod listener is dropped.

Changes

Implement a simple _readiness status change when k8s signals (SIGTERM/SIGKILL) to shut down the pod
Sleep for a configurable window prior to completing shutdown so k8s LB (and Envoy) have time to figure this out before the app stops listening and attempts to finish in-flights

Let's see what effect this has in mirror, and iterate from there. Handling readiness probes properly is good practice anyway.

Q: Why didn't you reuse the HealthRegistry for this? I considered it but didn't want unrelated global changes in app health to bleed over into the _readiness probe behavior until I've seen this act strictly on deploy cutovers.

How did you test this code?

Locally and CI; will smoke test in mirror deploy & eval results prior to merge

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Changelog: (features only) Is this feature complete?

N/A

Dec 10 '25 00:12 eli-r-ph