sanic icon indicating copy to clipboard operation
sanic copied to clipboard

nginx + sanic problems

Open oren-nonamesecurity opened this issue 1 year ago • 10 comments

Describe the bug

Hi, I'm using nginx as reverse-proxy to handle TLS and route to multiple applications on same server. My sanic app is the main route that handles most of the traffic. The traffic is pretty high (~5000 requests per second, on 16 workers) and I get some random errors in the nginx every few minutes. The errors are from these 3 types:

2022/08/18 12:05:20 [error] 42#42: *34037 writev() failed (32: Broken pipe) while sending request to upstream, client: 10.100.13.56, server: , request: "POST /engine HTTP/1.1", upstream: "http://127.0.0.1:8080/", host: "10.100.13.154"
2022/08/18 12:05:40 [error] 42#42: *35312 upstream prematurely closed connection while reading response header from upstream, client: 10.100.13.82, server: , request: "POST /engine HTTP/1.1", upstream: "http://127.0.0.1:8080/", host: "10.100.13.154"
2022/08/18 12:10:52 [error] 40#40: *55659 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.100.13.66, server: , request: "POST /engine HTTP/1.1", upstream: "http://127.0.0.1:8080/", host: "10.100.13.154"

On the sanic side there's nothing in the logs. access_log=True and also debug=True didn't show any error or anything suspicious, I also tried verbosity=5 but still nothing relevant.

Going deeper, I started tshark to try to see what actually happens, and I saw that the Sanic side actually sent tcp RST in the middle of a POST request being sent to it.

I tried setting SANIC_NOISY_EXCEPTIONS, increase SANIC_KEEP_ALIVE_TIMEOUT and REQUEST_BUFFER_SIZE - but still nothing helps.

I saw https://github.com/sanic-org/sanic/issues/991 which is quite old and has similar errors, but I'm using latest version and still have them.

Any idea what can cause it? Or how to fix it?

Environment (please complete the following information):

  • OS: Ubuntu 18.04
  • Sanic Version: 22.6.2 (Running inside docker using FROM python:3.8-slim-buster)
  • Docker version: 20.10.16, running containers with host network

oren-nonamesecurity avatar Aug 18 '22 13:08 oren-nonamesecurity

I'd guess likely timeout issue. What are your timeouts?

ahopkins avatar Aug 18 '22 15:08 ahopkins

@oren-nonamesecurity Can you see with packet capture at which point of connection this occurs? Has Sanic already handled some request(s) on that connection or does it reset prior to responding anything? If it is possible by packet timing, determine more precisely at which time this occurs (is the connection idle before, did a request go without response, or did Sanic RST right after sending a response, etc). That should narrow down the bug hunt.

I assume you are not using https with the internal connection to Sanic, that could cause TLS handshake errors not being logged by Sanic (a known problem).

Tronic avatar Aug 18 '22 17:08 Tronic

I started with default timeouts, and increased keep alive timeout to 15 seconds.

From the PCAP it looks like the connection was active for a while (at least 5 previous request-response pairs), and the current problem happens during a new POST request (of size a bit bigger than 1MB). The RST is the first response from the sanic side, no other response before that. The client sent 7 tcp payloads of 65551 before the sanic responded with RST, which made me think of buffer sizes or some system limits.

The app itself always responds 200 OK with "OK" as content, nothing complicated on the response.

And you're right, the nginx<=>sanic connection is http only.

oren-nonamesecurity avatar Aug 20 '22 15:08 oren-nonamesecurity

@Tronic @ahopkins

I sniffed another PCAP that shows a more concrete flow: image

The client (nginx) sends a long POST and got 200 OK with "Connection: keep-alive". About 5 secods later, the server (sanic) sends FIN+ACK while the client (nginx) sends new POST request (packets #31781 and #31955). Then the client keeps sending more data and the server responds RST multiple times.

This happens even when I clean the entire sanic app, leaving single endpoint with return only.

Any idea what happens here?

oren-nonamesecurity avatar Aug 21 '22 12:08 oren-nonamesecurity

@oren-nonamesecurity It certainly looks like the idle connection timing out after five seconds and Sanic is closing it at 31781 (no logging because this is a normal operation). The default SANIC_KEEP_ALIVE_TIMEOUT is five seconds, so double check that your setting is reaching the Sanic server in case you tried another value.

After that Nginx is expected to disconnect but why it instead sends another request 0.6 seconds later, I don't know. Technically it can because its side of the socket is still open, but since Sanic is no longer reading, it'll only lead to those resets that you are seeing.

@ahopkins This is a subtle detail but we should probably make Sanic explicitly shutdown the socket in both directions when it no longer intends to read or write anything:

sock.shutdown(socket.SHUT_RDWR)
sock.close()

Tronic avatar Aug 23 '22 08:08 Tronic

@oren-nonamesecurity It certainly looks like the idle connection timing out after five seconds and Sanic is closing it at 31781 (no logging because this is a normal operation). The default SANIC_KEEP_ALIVE_TIMEOUT is five seconds, so double check that your setting is reaching the Sanic server in case you tried another value.

After that Nginx is expected to disconnect but why it instead sends another request 0.6 seconds later, I don't know. Technically it can because its side of the socket is still open, but since Sanic is no longer reading, it'll only lead to those resets that you are seeing.

@ahopkins This is a subtle detail but we should probably make Sanic explicitly shutdown the socket in both directions when it no longer intends to read or write anything:

sock.shutdown(socket.SHUT_RDWR)
sock.close()

The FIN from sanic and the extra POST from nginx happen almost simultaneously (6ms difference). Maybe the shutdown is what's missing here.

For now I increased the sanic keep alive timeout to be larger than the nginx keep alive timeout, and it prevents these errors from happening because the nginx will close from his side first.

oren-nonamesecurity avatar Aug 23 '22 08:08 oren-nonamesecurity

The FIN from sanic and the extra POST from nginx happen almost simultaneously (6ms difference). Maybe the shutdown is what's missing here.

Oops, I misread the capture, although ~7 ms on localhost is still plenty of time. It'd be interesting to see whether the shutdown makes a difference, if you can easily reproduce this issue. It might be a one-liner addition to https://github.com/sanic-org/sanic/blob/main/sanic/server/protocols/base_protocol.py#L90 directly accessing the asyncio transport socket. The actual PR needs more careful thought and review even if that works.

Tronic avatar Aug 23 '22 08:08 Tronic

Have not looked too deeply into the above (sorry). But I did a quick experiment:

            sock: Optional[socket] = self.transport.get_extra_info("socket")
            if sock is not None:
                sock.shutdown(SHUT_RDWR)
Traceback (most recent call last):
  File "/home/adam/Projects/Sanic/sanic/sanic/server/protocols/http_protocol.py", line 169, in connection_task
    self.close()
  File "/home/adam/Projects/Sanic/sanic/sanic/server/protocols/base_protocol.py", line 94, in close
    sock.shutdown(SHUT_RDWR)
  File "uvloop/pseudosock.pyx", line 155, in uvloop.loop.PseudoSocket.shutdown
  File "uvloop/pseudosock.pyx", line 19, in uvloop.loop.PseudoSocket._na
TypeError: transport sockets do not support shutdown() method

Using uvloop this is not the approach.

ahopkins avatar Aug 23 '22 10:08 ahopkins

There is an open issue (with a hacky workaround too) https://github.com/MagicStack/uvloop/issues/241

Tronic avatar Aug 23 '22 10:08 Tronic

Nice find.

ahopkins avatar Aug 23 '22 11:08 ahopkins