trafficserver
trafficserver copied to clipboard
Read Response Retry Regression
We noticed that some requests were generating HTTP 502 errors. The cause is the origin server closed an HTTP/1 keep-alive connection. It happens as follows.
- An origin connection is released from the session pool.
- ATS buffers a request.
- Immediately after, ATS enters
state_read_server_response_header - Somewhere between steps 1 and 5, the server decides to close the connection, probably due to a keep alive time out, before the request arrives.
- ATS reads from the socket, and gets an EOS.6. The EOS is handled as follows https://github.com/apache/trafficserver/blob/2e244e56839b0eb755196dce5cafcec096c6fe19/src/proxy/http/HttpSM.cc#L1942
Note that EOS falls through, and retries are disabled from this point onwards. This behavior was introduced in PR #9366 Http2 to origin. This issue can be mitigated by reducing the keep alive timeout in ATS so it's lower than the origin's keep alive timeout.
I think ATS should provide an option to retry a request if an invalid response has been received.