varnish-cache icon indicating copy to clipboard operation
varnish-cache copied to clipboard

Connection errors and 503 response with slow clients, POST and http keep-alive - backend_idle_timeout not working as expected

Open thetuxkeeper opened this issue 3 years ago • 3 comments

Expected Behavior

Successful backend request

Current Behavior

It seems varnish selects a connection and cancels the backend_idle_timeout before it's sending anything of the request to the backend. That makes it really hard to synchronize the timeouts and is hard to debug.

You'll see something like that (varnishlog -q 'FetchError eq "HTC eof (-1)"' -i 'Backend*,FetchError,Timestamp,BereqMethod'):

*   << BeReq    >> 440802610 
-   Timestamp      Start: 1624437492.934397 0.000000 0.000000
-   BereqMethod    POST
-   BackendOpen    508 api01 x.x.x.183 8000 x.x.x.40 53832 reuse
-   Timestamp      Bereq: 1624437505.087841 12.153444 12.153444
-   FetchError     HTC eof (-1)
-   BackendClose   508 api01 close
-   Timestamp      Beresp: 1624437505.087862 12.153465 0.000020
-   Timestamp      Error: 1624437505.087864 12.153466 0.000001

Screenshot_20210623_121820

As you can see, the backend tries to close the connection (backend keep-alive timeout of 5s), but varnish doesn't process the FIN and uses the connection much later which results in a failed request (503 in that case "backend fetch failed").

Possible Solution

Possible workaround would be adjusting the keep-alive timeouts with a big enough buffer between backend and varnish to minimize the chance of running into that issue.

But nevertheless varnish shouldn't select a connection and then let it idle until the client is done sending. Varnish should select the connection when it's ready to send at least something (headers?), so the backend knows that there is an active client and can reset timeouts. After sending anything we would have the between_bytes_timeout that can be adjusted to the backend.

Steps to Reproduce (for bugs)

  1. you need a slow client/POST request at least the body has to be slow
  2. backend with http keep-alive timeout lower than the clients send time

Context

We are debugging 503 return codes and want to fix them. Since Apache, nodejs and perhaps others have a default http keep-alive timeout of 5s and varnish 60s, we lowered backend_idle_timeout to 4s. With a higher timeout we get the exact opposite race condition, when nodejs is blocked, but nodejs timeout gets triggered in that time. Then nodejs closes the connection, but varnish already sent the request. It looks exactly the same (except tcpdumps), but is the other way around. So we are cornered by timeout race conditions ... :)

Your Environment

  • Version used: 6.5.1
  • Operating System and version: debian 10
  • Source of binary packages used: official varnish docker container

thetuxkeeper avatar Jun 23 '21 12:06 thetuxkeeper

Though not directly related, I would also like to point out std.late_100_continue() which controls when we motivate the client to send the request body.

nigoroll avatar Jun 28 '21 13:06 nigoroll

I believe this is the result of too agressive write vector caching before doing the first writev() system call against the backend. While piecing together the writes that make up backend request, the buffer space for vectors isn't exhausted, and no write() call is actually done as a result. Then we see that there is a request body to be sent, that first needs to be read from the client. And if the client takes a long time to send the request body, it is essentially the client read timeout and not the backend idle timeout that is in effect.

A V1L_Flush() explicitly before fetching a request body when present would make the writev() happen right after the conn is selected for reuse.

mbgrydeland avatar Jun 28 '21 13:06 mbgrydeland

Hi, what's the current status of this issue? I just ran into something similar. Is there any ETA on a fix?

dcepelik avatar Mar 24 '22 14:03 dcepelik