gunicorn icon indicating copy to clipboard operation
gunicorn copied to clipboard

Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in sync, async and threaded workers

Open michalc opened this issue 11 months ago • 12 comments

[!NOTE]
Earlier versions of this bug limited to GET requests, and limited to when the body of the request was empty and for transfer-encoding: chunked requests, but these were inaccurate. The issue happens on both GET and POST requests, and when the request body is non-empty, and for both chunked and non-chunked request bodies. Earlier versions also did not make clear the error happened even if the worker timeout is disabled, and were for just sync workers whereas the issue happens in all workers except tornado where keep-alive is not possible or disabled

If I make a large empty file:

mkfile -n 1g temp_1GB_file

A basic gunicorn+Flask server to serve it (without reading the http request body) in test.py:

from flask import Flask

app = Flask(__name__, static_folder='.')

@app.route('/<path:path>', methods=['GET', 'POST'])
def static_proxy(path):
    return app.send_static_file(path)

Run with disabled worker timeout (to exclude that as the issue), and on any of sync, gevent , eventlet or gthread worker classes, as long as keep-alive is not possible or disabled:

gunicorn --worker-class sync --workers=1 --timeout=0 --keep-alive=0 test:app

And then some code to slowly make a GET or POST request to retrieve the large file through the webserver, but in a way that forces the request to be slow, specifically being slow about both ending the request body and fetching the response:

import urllib3
import time

def data():
    time.sleep(4)
    yield b'Anything'

resp = urllib3.request("POST", "http://localhost:8000/temp_1GB_file", 
    body=data(),
    preload_content=False,
)
for chunk in resp.stream(65536):
    time.sleep(0.001)

Then almost always I get the error on the client:

urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))

And no error on the server at all.

Interestingly:

  • if I remove the sleeps in the request code, the error doesn't occur
  • or if I use Flask's built-in webserver via flask --app test run --port 8080 instead of gunicorn, the error doesn't occur
  • or if I use uwsgi via uwsgi --http :8000 --wsgi-file test.py --callable app, the error doesn't occur
  • of if I re-enable keep-alive on the gevent, eventlet or gthread, the error doesn't occur
  • or if I access the bytes of the request body on the server before returning from the handler, say from print(request.data), then the error doesn't occur
  • or if I reduce the amount of data returned from the server, the error does not occur
  • or if I use the the tornado worker class, the error does not occur

And:

  • if I send the request body without chunked encoding, i.e. by passing headers={'content-length': '8'} in the above case, the issue still occurs

I'm on gunicorn 23.0.0 on macOS (but I suspect this also happens on Linux in Docker, since I'm chasing down a production issue)

Running Wireshark on the connection I do see an RST from server to client just before the error, and nothing obviously unusual before (although not sure if I would recognise much)

michalc avatar Dec 30 '24 20:12 michalc

there is something specific to the gunicorn sync worker

Correct. The default --timeout=30 puts a hard limit on continuous sends in the sync worker. The worker is treated as misbehaving and aborted if unresponsive for that time. Whereas the async workers stay alive even while busy.

N.B. you tested the socket.sendfile() case - which does make surprise-free timeout enforcement more complex.. but that did not seem necessary here, because your reproducer results in the same behavior from the clients perspective, even if --no-sendfile is used.


Previously reported as https://github.com/benoitc/gunicorn/issues/1736

pajod avatar Dec 31 '24 14:12 pajod

Ah @pajod thanks for your reply.

My suspicion is that something other than the default 30 second worker timeout is at play however. I can force it to hit the timeout, say by making it really short:

gunicorn --workers=1 --timeout 10 test:app

and then I see errors on the server side that say [CRITICAL] WORKER TIMEOUT for example.

But then if I set the worker timeout to be really long

gunicorn --workers=1 --timeout 3600 test:app

or disabled:

gunicorn --workers=1 --timeout 0 test:app

then within around 20 to 32 seconds of starting the request, I get the ConnectionResetError in the client, and just silence in the logs on the server, and specifically no error saying [CRITICAL] WORKER TIMEOUT

michalc avatar Dec 31 '24 14:12 michalc

(I've now amended the bug description to include having the worker timeout disabled)

michalc avatar Dec 31 '24 14:12 michalc

something other than the default 30 second worker timeout is at play

I see. That was merely also happening. I can see the same thing, even if the timeout does not kick in. The last ~50 chunks of queued writes are not seen by the slow-reading client, unless I make the server sleep(1) before closing the socket.

pajod avatar Dec 31 '24 16:12 pajod

Running Wireshark on the connection I do see an RST from server to client just before the error

So this is rolling around my head. From some googling, it looks like if a socket is closed while there is data in its read buffer, it will then send a RST to the other end. Could this be what the sync worker is doing?

If so, could probably come up with an even more minimal example, without using a large file at all and just small amounts of data and some deliberately positioned sleeps…

michalc avatar Dec 31 '24 18:12 michalc

Have now got a slightly more minimal example. A 3-line server (no Flask and no files) in test.py

def app(environ, start_response):
    start_response('200 OK', [])
    return [b'-' * 6553600]

that's run with

gunicorn test:app

And then code to emulate a slow client:

import urllib3
import time

def data():
    time.sleep(1)
    yield b'-'

resp = urllib3.request("POST", "http://localhost:8000/", 
    body=data(),
    preload_content=False,
)
for chunk in resp.stream(65536):
    time.sleep(0.001)

And within around 3 seconds, consistently, the client code raises a ConnectionResetError. This is fairly similar to the issues in production I've seen - errors happening within a low number of seconds.

michalc avatar Dec 31 '24 22:12 michalc

Aha! From poking around around code, also found that the error occurs on gevent and gthread workers if keepalive is disabled, so

gunicorn --workers=1 --worker-class=gevent --keep-alive=0 test:app

or

gunicorn --workers=1 --worker-class=gthread --keep-alive=0 test:app

I suspect what causes the issue to not occur is the keepalive loop, and specifically the fact that it calls the http parser in the loop, which has code to "Discard any unread body of the previous message", which means that when the socket's close function is called, the chance of there being unread data in the socket's read buffer is much smaller (although... not impossible if the client just snuck in some data "last minute"?)

If keep-alive is not enabled - either by specific configuration or by using the sync worker - then there is no keep alive loop, which means the parser won't be called again to read and discard unread body data, and so for a slow client there is a much higher chance that data is in the socket's read buffer by the time its close function is called.

Am tempted to amend the issue title and description

michalc avatar Jan 01 '25 09:01 michalc

I wonder if the fix to this would be to call

client.shutdown(socket.SHUT_RDWR)

Just before closing the socket, in both sync and async workers? From some googling, this might more correctly ignore any unread data that was sent from the client?

michalc avatar Jan 01 '25 09:01 michalc

Actually have found a variation of request code that shows the error as well. Don't need a sleep to delay sending the request body, as long as there is a fair amount of it:

import urllib3
import time

resp = urllib3.request("POST", "http://localhost:8000/", 
    body=b'-' * 65536,
    preload_content=False,
)
time.sleep(1)
for chunk in resp.stream(65536):
    time.sleep(0.005)

michalc avatar Jan 02 '25 06:01 michalc

@michalc, could it be related with this https://github.com/benoitc/gunicorn/issues/3380? However it is only for gthread

michdan avatar Apr 11 '25 21:04 michdan

@michdan So it's been a while, but I suspect no, since:

  • As you say, that issue is only gthread and this one is wider
  • ... and also the issue and fix seems to be around restarts, but here the issue is during normal non-restarting operation when there is data that hasn't been read by the server

michalc avatar Apr 12 '25 06:04 michalc

+1

gsmethells avatar Oct 15 '25 05:10 gsmethells