trafficserver icon indicating copy to clipboard operation
trafficserver copied to clipboard

Read Response Retry Regression

Open moonchen opened this issue 5 months ago • 0 comments

We noticed that some requests were generating HTTP 502 errors. The cause is the origin server closed an HTTP/1 keep-alive connection. It happens as follows.

  1. An origin connection is released from the session pool.
  2. ATS buffers a request.
  3. Immediately after, ATS enters state_read_server_response_header
  4. Somewhere between steps 1 and 5, the server decides to close the connection, probably due to a keep alive time out, before the request arrives.
  5. ATS reads from the socket, and gets an EOS.6. The EOS is handled as follows https://github.com/apache/trafficserver/blob/2e244e56839b0eb755196dce5cafcec096c6fe19/src/proxy/http/HttpSM.cc#L1942

Note that EOS falls through, and retries are disabled from this point onwards. This behavior was introduced in PR #9366 Http2 to origin. This issue can be mitigated by reducing the keep alive timeout in ATS so it's lower than the origin's keep alive timeout.

I think ATS should provide an option to retry a request if an invalid response has been received.

moonchen avatar Aug 05 '25 18:08 moonchen