receive: recycle capnp peers on timeout
Detect Cap’n Proto fan-out timeouts, mark the receiver peer down, and force a reconnect so distributor pods self-heal instead of spinning on dead sockets.
- [x] I added CHANGELOG entry for this change.
- [ ] Change is not relevant to the end user.
Changes
- Mark Cap’n Proto peer connections unavailable on any forward error and close them to trigger a fresh dial.
- Teach the Cap’n Proto remote write client to reconnect after context deadline / send errors, covering wrapped exceptions.
- Add focused regression tests for handler peer recycling and Cap’n Proto client reconnects.
Verification
go test ./pkg/receive -run TestSendRemoteWriteMarksPeerUnavailableOnAnyErrorgo test ./pkg/receive/writecapnp -run TestRemoteWriteClientReconnectsOnDeadline
https://github.com/thanos-io/thanos/pull/8491 could you help with reviewing this? Seems like 8491 has a better way of detecting issues.
@GiedriusS Thanks for taking a look! In the production fleet that motivated this change, once a receiver pod restarted the distributor’s existing Cap’n Proto stream turned into a half-open socket. Every forward attempt then hit the 5 s fan-out deadline with rpc: bootstrap: send message and the worker stayed stuck on that same connection until the distributor process was bounced manually. PR #8491 tightens things at dial time—if bootstrap fails immediately it closes the socket. The failures observed here occur later: bootstrap had already succeeded, the receiver disappeared, and the next write timed out. In that case connect never runs again, so #8491 doesn’t fire and the stale connection is left in place. This patch handles that follow-on state by marking the peer down and closing the cached worker on any non‑nil send error, and by teaching the client to treat those deadline errors as reconnect-worthy. Both changes stack well: #8491 rejects broken sockets up front, while this one recovers when a previously healthy connection goes bad.
Thank you for your PR. I rewrote it a bit differently and will open up a PR soon. I also added a test.