trafficserver icon indicating copy to clipboard operation
trafficserver copied to clipboard

9.2.x Unexpected connection errors when proxying larger request bodies

Open GUI opened this issue 2 years ago • 1 comments

When upgrading from 9.1.4 to 9.2.x, I've observed some issues with requests failing in unexpected ways that they didn't previously in 9.1.4. I'm not really certain what's happening, so it's a bit difficult to give a summary, but it seems like perhaps Traffic Server 9.2.x is closing connections to clients in front of Traffic Server unexpectedly too early in certain cases. My only theory is that it's somehow related to larger request bodies (and maybe specific to PUT requests), and maybe only when these larger request bodies are still being streamed after the origin server generates an (expected) error? But again, not really sure.

Here's a more detailed example of how this is pretty reproducible in all versions of 9.2.0-9.2.2, and also demonstrates how this didn't happen in 9.1.4. The basic reproducible case I've narrowed this down to looks like this:

[nginx proxy] => [trafficserver] => [nginx server]
  1. The nginx proxy layer does not have a maximum request body size.
  2. The underlying nginx server component is setup with a maximum request body size. If a client sends a request body that exceeds this size, then nginx returns a 413 Request Entity Too Large error.

The basic issue I'm seeing is that if a client exceeds this request body size at the nginx server origin layer then Traffic Server 9.2+ seems to behave in unexpected ways:

  1. In Traffic Server 9.1.4: The nginx proxy layer (and client making the request) reliably receives the 413 Request Entity Too Large error that the origin nginx server layer generates and is proxied via TrafficServer.
  2. In Traffic Server 9.2.2: The nginx proxy layer maybe 50% of the time receives the expected 413 error (from the nginx server origin), but then the other 50% of the time the nginx proxy ends up reporting a 502 Bad Gateway error which is generated by nginx due to an apparent communication errors with trafficserver. This seems to indicate that the connection from nginx proxy to trafficserver is being closed unexpectedly by Traffic Server too early before the 413 error can be proxied back successfully.

Here is a repo that contains a minimal reproduction of this along more detailed steps: https://github.com/GUI/trafficserver-debugging This issue appears to be present using all default Traffic Server configuration, so there's no custom Traffic Server configuration other than proxying to the underlying server. See the repo's README for exact steps to reproduce an more examples of the expected output in Traffic Server 9.1.4 versus the new behavior that's more erratic in Traffic Server 9.2.2.

The short version is that Traffic Server 9.1.4 will always return the expected 413 Request Entity Too Large that is proxied from the underlying origin server, but when Traffic Server 9.2.x is in the middle, then it will randomly lead to nginx's connections to Traffic Server failing and the nginx proxy layer generates 502 Bad Gateway errors.

A few notes I've observed:

  • It happens more readily if the request body size is bigger (eg, more than a couple MBs).
  • Strangely, I can reproduce it reliably for PUT requests with a body, but not POST requests.
  • In tcpdumps, I've observed TCP RSTs under TrafficServer 9.2.x during these situations where there don't appear to be any RSTs in 9.1.x.
  • I've been able to reproduce this in both 9.2.0 and 9.2.2, so it seems like it's related to some change between 9.1.4 to 9.2.0.

Thanks!

GUI avatar Sep 11 '23 00:09 GUI

Since I saw 9.2.3 was out, I thought I'd give this another test, but just to confirm, I'm still seeing the same broken connection errors in 9.2.3 as I did in other 9.2.x releases. I updated https://github.com/GUI/trafficserver-debugging with a 9.2.3 reproduction as well.

GUI avatar Nov 09 '23 17:11 GUI