varnish-cache icon indicating copy to clipboard operation
varnish-cache copied to clipboard

Connection pools and remote closes

Open bsdphk opened this issue 3 months ago • 2 comments

We reuse connections in connection-pools in a Last-in-First-Out order.

When we VCP_Get() a handle, and the remote close(s|ed) it, theory says that all other fd's deeper in the pool will suffer the same fate.

In practice there is a delta to ideal, due to when the remote's write(2) returned relative to tcp-xmit-buffering and how long time it took us to pull the data out of our tcp-recv-buffer before we recycled the handle.

When a recycled handle fails, it probably makes sense to prevent reuse of all handles deeper in the pool, and just leave them for the pool-waiter to reap.

A more productive strategy might be to dynamically estimate, per pool, how long a handle can sit in the pool, before the chance of reusing it drops below an acceptable probability (95% ? 99%? 99.9% ?) but for that to work, it is important to not "pollute" the input data to the estimator with any other failure modes than remoted closed.

bsdphk avatar Nov 14 '25 09:11 bsdphk

A common pitfall when this happens is that the backend is configured to use a lower keep-alive settings than our backend_idle_timeout. The default for this parameter is 60 seconds. Sometimes both ends use 60 seconds and it becomes a race. A worst case scenario is when you got a load balancer of some kind between your Varnish server and the backend. In such cases, the load balancer may use different timeouts in both directions. Once we reuse the connection and make a request to the backend, we may hit a void.

Maybe the default for this parameter is configured too high in most cases? Or maybe what you said make more sense, we should reuse connections based on an estimate of success.

It could also be possible to make the VCL probe code sometimes reuse a connection to keep at least N fresh connections in the pool (a parameter of some kind could say what we deem as fresh connection). But, this needs to have a strategy to bring the pool of connections down to zero or we could simply agree to always keep N fresh connection to the backend.

asadsa92 avatar Nov 14 '25 09:11 asadsa92

Also, this issue was also the main motivation for me opening this PR: https://github.com/varnishcache/varnish-cache/pull/4007 It would be nice to see this merged some time soon 😄

asadsa92 avatar Nov 14 '25 10:11 asadsa92