gunicorn
gunicorn copied to clipboard
Connection reset during max-requests auto-restart with gthread
We're observing intermittent HTTP 502s in production, which seem to be correlated with the "autorestarting worker after current request" log line, and are less frequent as we increase max_requests
. I've reproduced this on 21.2.0 and 20.1.0, but it doesn't seem to happen in 20.0.4.
I've produced a minimal reproduction case following the gunicorn.org example as closely as possible, but please let me know if there are other changes you'd recommend:
Application
def app(environ, start_response):
data = b"Hello, World!\n"
start_response("200 OK", [
("Content-Type", "text/plain"),
("Content-Length", str(len(data)))
])
return iter([data])
Configuration
gunicorn --max-requests 10 --worker-class gthread myapp:app
Reproduction
#!/usr/bin/env bash
set -euo pipefail
# start gunicorn
gunicorn --max-requests 10 --log-level debug --worker-class gthread myapp:app &
# wait for gunicorn to be ready
sleep 5
# send 20 sequential HTTP requests with minimal output
for i in {1..20}; do
curl -S -s -o /dev/null -w "[REPRO]: Received HTTP %{http_code} from GET /${i}\n" "127.0.0.1:8000/${i}" || true
done
Quickstart
For convenience, I've packaged this into a single command that consistently reproduces the problem on my machine. If you have Docker installed, this should Just Work™️.
cat <<DOCKER_EOF | docker build -t gunicorn-repro-max-requests-rst - && docker run --rm gunicorn-repro-max-requests-rst
FROM python:3.11
RUN pip install --disable-pip-version-check --no-cache gunicorn
RUN cat <<EOF > myapp.py
def app(environ, start_response):
data = b"Hello, World!\n"
start_response("200 OK", [
("Content-Type", "text/plain"),
("Content-Length", str(len(data)))
])
return iter([data])
EOF
RUN cat <<EOF > repro.sh
#!/usr/bin/env bash
set -euo pipefail
# start gunicorn
gunicorn --max-requests 10 --log-level debug --worker-class gthread myapp:app &
# wait for gunicorn to be ready
sleep 5
# send 20 sequential HTTP requests with minimal output
for i in {1..20}; do
curl -S -s -o /dev/null -w "[REPRO]: Received HTTP %{http_code} from GET /\\\${i}\n" "127.0.0.1:8000/\\\${i}" || true
done
EOF
RUN chmod +x repro.sh
CMD ["./repro.sh"]
DOCKER_EOF
Example
Logs
[2023-07-21 00:20:10 +0000] [6] [DEBUG] Current configuration:
config: ./gunicorn.conf.py
wsgi_app: None
bind: ['127.0.0.1:8000']
backlog: 2048
workers: 1
worker_class: gthread
threads: 1
worker_connections: 1000
max_requests: 10
max_requests_jitter: 0
timeout: 30
graceful_timeout: 30
keepalive: 2
limit_request_line: 4094
limit_request_fields: 100
limit_request_field_size: 8190
reload: False
reload_engine: auto
reload_extra_files: []
spew: False
check_config: False
print_config: False
preload_app: False
sendfile: None
reuse_port: False
chdir: /
daemon: False
raw_env: []
pidfile: None
worker_tmp_dir: None
user: 0
group: 0
umask: 0
initgroups: False
tmp_upload_dir: None
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
forwarded_allow_ips: ['127.0.0.1']
accesslog: None
disable_redirect_access_to_syslog: False
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
errorlog: -
loglevel: debug
capture_output: False
logger_class: gunicorn.glogging.Logger
logconfig: None
logconfig_dict: {}
logconfig_json: None
syslog_addr: udp://localhost:514
syslog: False
syslog_prefix: None
syslog_facility: user
enable_stdio_inheritance: False
statsd_host: None
dogstatsd_tags:
statsd_prefix:
proc_name: None
default_proc_name: myapp:app
pythonpath: None
paste: None
on_starting: <function OnStarting.on_starting at 0xffff935672e0>
on_reload: <function OnReload.on_reload at 0xffff93567420>
when_ready: <function WhenReady.when_ready at 0xffff93567560>
pre_fork: <function Prefork.pre_fork at 0xffff935676a0>
post_fork: <function Postfork.post_fork at 0xffff935677e0>
post_worker_init: <function PostWorkerInit.post_worker_init at 0xffff93567920>
worker_int: <function WorkerInt.worker_int at 0xffff93567a60>
worker_abort: <function WorkerAbort.worker_abort at 0xffff93567ba0>
pre_exec: <function PreExec.pre_exec at 0xffff93567ce0>
pre_request: <function PreRequest.pre_request at 0xffff93567e20>
post_request: <function PostRequest.post_request at 0xffff93567ec0>
child_exit: <function ChildExit.child_exit at 0xffff92db8040>
worker_exit: <function WorkerExit.worker_exit at 0xffff92db8180>
nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0xffff92db82c0>
on_exit: <function OnExit.on_exit at 0xffff92db8400>
ssl_context: <function NewSSLContext.ssl_context at 0xffff92db85e0>
proxy_protocol: False
proxy_allow_ips: ['127.0.0.1']
keyfile: None
certfile: None
ssl_version: 2
cert_reqs: 0
ca_certs: None
suppress_ragged_eofs: True
do_handshake_on_connect: False
ciphers: None
raw_paste_global_conf: []
strip_header_spaces: False
[2023-07-21 00:20:10 +0000] [6] [INFO] Starting gunicorn 21.2.0
[2023-07-21 00:20:10 +0000] [6] [DEBUG] Arbiter booted
[2023-07-21 00:20:10 +0000] [6] [INFO] Listening at: http://127.0.0.1:8000 (6)
[2023-07-21 00:20:10 +0000] [6] [INFO] Using worker: gthread
[2023-07-21 00:20:10 +0000] [8] [INFO] Booting worker with pid: 8
[2023-07-21 00:20:10 +0000] [6] [DEBUG] 1 workers
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /1
[REPRO]: Received HTTP 200 from GET /1
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /2
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /2
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /3
[REPRO]: Received HTTP 200 from GET /3
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /4
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /4
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /5
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /5
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /6
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /6
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /7
[REPRO]: Received HTTP 200 from GET /7
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /8
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /8
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /9
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /9
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /10
[2023-07-21 00:20:15 +0000] [8] [INFO] Autorestarting worker after current request.
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /10
curl: (56) Recv failure: Connection reset by peer
[2023-07-21 00:20:15 +0000] [8] [INFO] Worker exiting (pid: 8)
[REPRO]: Received HTTP 000 from GET /11
[2023-07-21 00:20:15 +0000] [22] [INFO] Booting worker with pid: 22
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /12
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /12
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /13
[REPRO]: Received HTTP 200 from GET /13
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection.
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /14
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /14
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /15
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /15
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /16
[REPRO]: Received HTTP 200 from GET /16
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection.
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /17
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /17
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /18
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /18
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /19
[REPRO]: Received HTTP 200 from GET /19
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection.
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /20
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /20
Expected
I'd expect to receive an HTTP 200 for each request, regardless of the max-requests restart. We should see [DEBUG] GET /11
when the worker handles the 11th request.
Actual
The reproduction script sends GET /11
, but the worker never sees it, and we see a connection reset instead. The repro script reports a status code of 000
, but that's just a quirk of libcurl. I've used tcpdump and can confirm the RST
.
In case it's useful, I've also seen curl: (52) Empty reply from server
, but it happens less frequently and I'm not 100% sure that it's the same problem.
Workaround
Increasing max-requests makes this happen less frequently, but the only way to resolve it is to disable max-requests (or maybe switch to a different worker type?). Increasing the number of workers or threads doesn't seem to resolve the problem either, from what I've seen.
Maybe useful, here's a hacky fix I put together locally: edit workers/gthread.py
so that accept()
doesn't process accept new requests from the socket after force_close()
is called on the worker:
(Apologies for a crappy vim screenshot, I wanted to test this out before sitting down for dinner.)
New output:
[2023-07-21 02:51:31 +0000] [641] [INFO] Starting gunicorn 21.2.0
[2023-07-21 02:51:31 +0000] [641] [DEBUG] Arbiter booted
[2023-07-21 02:51:31 +0000] [641] [INFO] Listening at: http://127.0.0.1:8000 (641)
[2023-07-21 02:51:31 +0000] [641] [INFO] Using worker: gthread
[2023-07-21 02:51:31 +0000] [643] [INFO] Booting worker with pid: 643
[2023-07-21 02:51:31 +0000] [641] [DEBUG] 1 workers
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /1
[REPRO]: Received HTTP 200 from GET /1
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /2
[REPRO]: Received HTTP 200 from GET /2
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /3
[REPRO]: Received HTTP 200 from GET /3
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /4
[REPRO]: Received HTTP 200 from GET /4
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /5
[REPRO]: Received HTTP 200 from GET /5
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /6
[REPRO]: Received HTTP 200 from GET /6
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /7
[REPRO]: Received HTTP 200 from GET /7
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /8
[REPRO]: Received HTTP 200 from GET /8
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /9
[REPRO]: Received HTTP 200 from GET /9
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /10
[2023-07-21 02:51:36 +0000] [643] [INFO] Autorestarting worker after current request.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /10
[2023-07-21 02:51:36 +0000] [643] [DEBUG] accept() called when alive=False
[2023-07-21 02:51:36 +0000] [643] [INFO] Worker exiting (pid: 643)
[2023-07-21 02:51:36 +0000] [656] [INFO] Booting worker with pid: 656
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /11
[REPRO]: Received HTTP 200 from GET /11
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /12
[REPRO]: Received HTTP 200 from GET /12
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /13
[REPRO]: Received HTTP 200 from GET /13
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /14
[REPRO]: Received HTTP 200 from GET /14
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /15
[REPRO]: Received HTTP 200 from GET /15
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connectio
n.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /16
[REPRO]: Received HTTP 200 from GET /16
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /17
[REPRO]: Received HTTP 200 from GET /17
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /18
[REPRO]: Received HTTP 200 from GET /18
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /19
[REPRO]: Received HTTP 200 from GET /19
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /20
[2023-07-21 02:51:36 +0000] [656] [INFO] Autorestarting worker after current request.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /20
I've found a related issue, ~which is not resolved by #3039, although it'd be unlikely to be deployed in the wild~: when max_requests >= threads
we see the same connection reset error.
For example:
gunicorn --worker-class gthread --max-requests 4 --threads 4 myapp:app
With this config, we can reproduce a consistent connection reset with only five HTTP requests.
[2023-07-21 15:59:17 +0000] [7] [DEBUG] Current configuration:
config: ./gunicorn.conf.py
wsgi_app: None
bind: ['127.0.0.1:8000']
backlog: 2048
workers: 1
worker_class: gthread
threads: 4
worker_connections: 1000
max_requests: 4
max_requests_jitter: 0
timeout: 30
graceful_timeout: 30
keepalive: 2
limit_request_line: 4094
limit_request_fields: 100
limit_request_field_size: 8190
reload: False
reload_engine: auto
reload_extra_files: []
spew: False
check_config: False
print_config: False
preload_app: False
sendfile: None
reuse_port: False
chdir: /
daemon: False
raw_env: []
pidfile: None
worker_tmp_dir: None
user: 0
group: 0
umask: 0
initgroups: False
tmp_upload_dir: None
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
forwarded_allow_ips: ['127.0.0.1']
accesslog: None
disable_redirect_access_to_syslog: False
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
errorlog: -
loglevel: debug
capture_output: False
logger_class: gunicorn.glogging.Logger
logconfig: None
logconfig_dict: {}
logconfig_json: None
syslog_addr: udp://localhost:514
syslog: False
syslog_prefix: None
syslog_facility: user
enable_stdio_inheritance: False
statsd_host: None
dogstatsd_tags:
statsd_prefix:
proc_name: None
default_proc_name: myapp:app
pythonpath: None
paste: None
on_starting: <function OnStarting.on_starting at 0xffff9b1c72e0>
on_reload: <function OnReload.on_reload at 0xffff9b1c7420>
when_ready: <function WhenReady.when_ready at 0xffff9b1c7560>
pre_fork: <function Prefork.pre_fork at 0xffff9b1c76a0>
post_fork: <function Postfork.post_fork at 0xffff9b1c77e0>
post_worker_init: <function PostWorkerInit.post_worker_init at 0xffff9b1c7920>
worker_int: <function WorkerInt.worker_int at 0xffff9b1c7a60>
worker_abort: <function WorkerAbort.worker_abort at 0xffff9b1c7ba0>
pre_exec: <function PreExec.pre_exec at 0xffff9b1c7ce0>
pre_request: <function PreRequest.pre_request at 0xffff9b1c7e20>
post_request: <function PostRequest.post_request at 0xffff9b1c7ec0>
child_exit: <function ChildExit.child_exit at 0xffff9aa18040>
worker_exit: <function WorkerExit.worker_exit at 0xffff9aa18180>
nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0xffff9aa182c0>
on_exit: <function OnExit.on_exit at 0xffff9aa18400>
ssl_context: <function NewSSLContext.ssl_context at 0xffff9aa185e0>
proxy_protocol: False
proxy_allow_ips: ['127.0.0.1']
keyfile: None
certfile: None
ssl_version: 2
cert_reqs: 0
ca_certs: None
suppress_ragged_eofs: True
do_handshake_on_connect: False
ciphers: None
raw_paste_global_conf: []
strip_header_spaces: False
[2023-07-21 15:59:17 +0000] [7] [INFO] Starting gunicorn 21.2.0
[2023-07-21 15:59:17 +0000] [7] [DEBUG] Arbiter booted
[2023-07-21 15:59:17 +0000] [7] [INFO] Listening at: http://127.0.0.1:8000 (7)
[2023-07-21 15:59:17 +0000] [7] [INFO] Using worker: gthread
[2023-07-21 15:59:17 +0000] [9] [INFO] Booting worker with pid: 9
[2023-07-21 15:59:17 +0000] [7] [DEBUG] 1 workers
[2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /1
[2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /1
[2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /2
[2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /2
[2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /3
[2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /3
[2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /4
[2023-07-21 15:59:22 +0000] [9] [INFO] Autorestarting worker after current request.
[2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /4
curl: (56) Recv failure: Connection reset by peer
[2023-07-21 15:59:22 +0000] [9] [INFO] Worker exiting (pid: 9)
[REPRO]: Received HTTP 000 from GET /5
After a git bisect
, it looks like the culprit is 0ebb73aa240f0ecffe3e0922d54cfece19f5bfed (https://github.com/benoitc/gunicorn/pull/2918).
We are seeing this when making API calls to NetBox as well. gunicorn version is 21.2.0. NGINX logs randomly show an upstream error "104: Connection reset by peer" which we correlate with "Autorestarting worker after current request." in gunicorn logs.
@MANT5149, I'm in the same boat as well, after NetBox upgrade to 3.5.7 (which includes gunicorn 21.2.0) we're seeing same issue when autorestart is happening:
Aug 10 11:40:43 hostname gunicorn[595183]: [2023-08-10 11:40:43 +0000] [595183] [INFO] Autorestarting worker after current request.
Aug 10 11:40:43 hostname gunicorn[595183]: [2023-08-10 11:40:43 +0000] [595183] [INFO] Worker exiting (pid: 595183)
Aug 10 11:40:44 hostname gunicorn[613565]: [2023-08-10 11:40:44 +0000] [613565] [INFO] Booting worker with pid: 613565
Aug 10 12:33:36 hostname gunicorn[622501]: [2023-08-10 12:33:36 +0000] [622501] [INFO] Autorestarting worker after current request.
Aug 10 12:33:36 hostname gunicorn[622501]: [2023-08-10 12:33:36 +0000] [622501] [INFO] Worker exiting (pid: 622501)
Aug 10 12:33:36 hostname gunicorn[639160]: [2023-08-10 12:33:36 +0000] [639160] [INFO] Booting worker with pid: 639160
Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Autorestarting worker after current request.
Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Worker exiting (pid: 579373)
Aug 10 13:00:07 hostname gunicorn[648814]: [2023-08-10 13:00:07 +0000] [648814] [INFO] Booting worker with pid: 648814
Nginx error log:
2023/08/10 11:40:43 [error] 1092#0: *744453 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.xxx.xxx, server: netbox.example.com, request: "GET /api/tenancy/tenants/19/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/tenancy/tenants/19/", host: "netbox-api.example.com"
2023/08/10 12:33:36 [error] 1092#0: *776456 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.168.xxx.xxx, server: netbox.example.com, request: "GET /api/virtualization/clusters/46/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/virtualization/clusters/46/", host: "netbox-api.example.com"
2023/08/10 13:00:06 [error] 1092#0: *787694 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.xxx.xxx, server: netbox.example.com, request: "GET /api/tenancy/tenants/19/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/tenancy/tenants/19/", host: "netbox-api.example.com"
Downgrading Gunicorn to 20.1.0 fixes this issue.
Have y'all tried the patch in #3039?
thanks for the report. I'ts expected to have the worker restart there. If a request already land in the worker when its closing that may happen. Maybe latest change accepting the requests faster trigger it. Can you give an idee of the number of concurrent request gunicorn is receiving also in usial case?
Beside that why do you put a max request so low? This function should only be used as a work around some temporarymemory issues in the application.
The worker restart is expected, but after 0ebb73a (#2918) every request runs through the while self.alive
loop twice:
- the first time through the loop it calls
accept()
on the socket to create the connection - the second time through the loop it calls
recv()
on the connection to handle the request and provide a response
This means that during a max-requests restart, we call accept()
on requests but never recv()
, which means that they aren't added to self.futures
to be awaited during the graceful timeout.
Before the change, if you sent two requests around the same time you'd see:
-
self.alive is True
so we:-
accept()
request A -
recv()
request A, which setsself.alive = False
inhandle_request
- worker restart
-
-
self.alive is False
, so we exit the loop and restart the worker -
self.alive is True
, so we:-
accept()
request B -
recv()
request B
-
After the change, this becomes:
-
self.alive is True
so weaccept()
request A -
self.alive is True
so we:-
recv()
request A, which setsself.alive = False
inhandle_request
-
accept()
request B
-
-
self.alive is False
, so we exit the loop and restart the worker (‼️ without handling request B ‼️)
Can you give an idee of the number of concurrent request gunicorn is receiving also in usial case?
This bug only requires two concurrent requests to a worker, but often I'll often have ~4 concurrent requests per worker and this bug means that 1 will be completed and the rest have their connections reset. ~To highlight the impact, this means that setting max-requests at 10k and averaging 4 concurrent requests will cause 0.03% of HTTP requests to fail.~ EDIT: I've included a benchmark in #3039 instead of speculating.
Beside that why do you put a max request so low? This function should only be used as a work around some temporarymemory issues in the application.
That's a minimal reproducible example, not the real application + configuration I'm using in production. I'm using a range from 1,000 to 1,000,000 depending on the exact application / deployment, and completely agree that --max-requests 10
is inappropriately low for most scenarios.
@benoitc I just added a benchmark to #3039 showing an error rate of 0.09% when we restart every 1024 requests, in case that's helpful.
Have y'all tried the patch in https://github.com/benoitc/gunicorn/pull/3039?
@christianbundy, I've tried it and it appears to fix this issue for me
#3039 fixes this issue BUT for me the idle load goes up from 0.x to 5!
0.x to 5
Are these CPU utilization percentages? i.e. CPU load goes increases from <1% to 5%? This makes sense because previously we were blocking and waiting on IO for two seconds while waiting for new connections, but now we're looping to check for either 'new connections' or 'new data on a connection'.
no, 5 means 5 cpu cores are 100% utilized (500%), I have 5 workers configured. The arrows point to bumps where I ran a script that fires many requests against a gunicorn server.
Thanks for the context @r-lindner -- admittedly I don't think I understand the difference between 'fix from #3038' and 'fix from #3039', but I can see that the CPU utilization is significantly higher.
I've just pushed a commit to my PR branch that resolves the issue on my machine, can you confirm whether it works for you? It's a significantly different approach, which I haven't tested as thoroughly, but it seems to resolve the connection resets and also keeps the CPU utilization low.
Before: idles at ~10% CPU
After: idles at ~0.01% CPU
Maybe useful, here's a hacky fix I put together locally: edit
workers/gthread.py
so thataccept()
doesn't process accept new requests from the socket afterforce_close()
is called on the worker:(Apologies for a crappy vim screenshot, I wanted to test this out before sitting down for dinner.)
New output:
this is the 'fix from #3038' and the 'fix from #3039' was the pull request with out the changes from Aug 26. I am now using the updated #3039 without CPU issues. Due to changes I made a week ago I cannot test if the original bug is fixed but I guess you already tested this :-) So this looks good.
When fronting gunicorn 20.1.0 with nginx 1.23.3, we observe "connection reset by peer" errors in Nginx that correlate with gthread worker auto restarts.
https://github.com/benoitc/gunicorn/issues/1236 seems related, which describes an issue specifically with keepalive connections. That issue is older and I am unsure of the current state, but this comment implies an ongoing issue. Note the original reproduction steps in this issue, 3038, have keepalive enabled by default.
When we disable keepalives in gunicorn, we observe a latency regression but it does stop the connection reset errors.
Should there be documented guidance, for now, not to use --max-requests
with keepalive + gthread workers?
As far as I can see, options for consumers are:
- Disable keepalive if using gthread workers + with max requests
- Possibly use gevent workers instead
We face this issue exactly as described. Thanks for the reporting and ongoing work on this. Is there an ETA for the publication of a fix ?
Same issue here. I swapped to gthread workers from sync, and randomly, my server just stopped taking requests.
Reverted back to sync for now.
We are also running into this issue after a NetBox upgrade. Downgrading gunicorn to 20.1.0 fixes it for the moment but a proper fix would be appreciated.
We are also running into this problem after upgrading Netbox from 3.4.8 to 3.6.9, which makes gunicorn go from 20.1.0 to 21.2.0.
One of the heavier scripts works flawlessly on Netbox 3.4.8 (gunicorn 20.1.0), but on 3.6.9 (gunicorn 21.2.0) it fails with the below message and it has not failed at the exact same place twice:
Traceback (most recent call last): File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 394, in
main() File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 309, in main service_entry = nb_update_vs(nb_vs,sat_gw_ip,sat_gw_name,community_name,community_type,interop_hack_ip) File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 124, in nb_update_vs nb_service_entry_ip = str(nb_service_entry.ipaddresses[0]).split('/')[0] File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 327, in str getattr(self, "name", None) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 303, in getattr if self.full_details(): File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 459, in full_details self._parse_values(next(req.get())) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 291, in get req = self._make_call(add_params=add_params) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 258, in _make_call raise RequestError(req) pynetbox.core.query.RequestError: The request failed with code 502 Bad Gateway but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.
/var/log/nginx/error.log:
2024/01/08 14:27:35 [error] 919#919: *110908 recv() failed (104: Unknown error) while reading response header from upstream, client: 10.10.10.74, server: netbox-test.domain.test, request: "GET /api/ipam/ip-addresses/24202/ HTTP/1.1", upstream: "http://10.10.10.5:8001/api/ipam/ip-addresses/24202/", host: "netbox-test.domain.test"
gunicorn log:
Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Autorestarting worker after current request. Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Worker exiting (pid: 1129991) Jan 08 14:39:26 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:39:26 +0100] [1139845] [INFO] Booting worker with pid: 1139845 Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Autorestarting worker after current request. Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Worker exiting (pid: 1129962) Jan 08 14:44:11 30001vmnb02-prod gunicorn[1139926]: [2024-01-08 14:44:11 +0100] [1139926] [INFO] Booting worker with pid: 1139926
Versions: Netbox: 3.6.9 Python: 3.10.12 Redis version=6.0.16 nginx version: nginx/1.18.0 (Ubuntu) psql (PostgreSQL) 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) gunicorn (version 21.2.0) pynetbox: 7.3.3
Linux vmnb02-test 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Is there a release underway to fix this or should we still refrain from upgrading? Current state of gunicorn dictates that it is not production-worthy. :(
We are also running into this problem after upgrading Netbox from 3.4.8 to 3.6.9, which makes gunicorn go from 20.1.0 to 21.2.0.
One of the heavier scripts works flawlessly on Netbox 3.4.8 (gunicorn 20.1.0), but on 3.6.9 (gunicorn 21.2.0) it fails with the below message and it has not failed at the exact same place twice:
Traceback (most recent call last): File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 394, in main() File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 309, in main service_entry = nb_update_vs(nb_vs,sat_gw_ip,sat_gw_name,community_name,community_type,interop_hack_ip) File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 124, in nb_update_vs nb_service_entry_ip = str(nb_service_entry.ipaddresses[0]).split('/')[0] File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 327, in str getattr(self, "name", None) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 303, in getattr if self.full_details(): File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 459, in full_details self._parse_values(next(req.get())) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 291, in get req = self._make_call(add_params=add_params) File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 258, in _make_call raise RequestError(req) pynetbox.core.query.RequestError: The request failed with code 502 Bad Gateway but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.
/var/log/nginx/error.log:
2024/01/08 14:27:35 [error] 919#919: *110908 recv() failed (104: Unknown error) while reading response header from upstream, client: 10.10.10.74, server: netbox-test.domain.test, request: "GET /api/ipam/ip-addresses/24202/ HTTP/1.1", upstream: "http://10.10.10.5:8001/api/ipam/ip-addresses/24202/", host: "netbox-test.domain.test"
gunicorn log:
Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Autorestarting worker after current request. Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Worker exiting (pid: 1129991) Jan 08 14:39:26 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:39:26 +0100] [1139845] [INFO] Booting worker with pid: 1139845 Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Autorestarting worker after current request. Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Worker exiting (pid: 1129962) Jan 08 14:44:11 30001vmnb02-prod gunicorn[1139926]: [2024-01-08 14:44:11 +0100] [1139926] [INFO] Booting worker with pid: 1139926
Versions: Netbox: 3.6.9 Python: 3.10.12 Redis version=6.0.16 nginx version: nginx/1.18.0 (Ubuntu) psql (PostgreSQL) 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) gunicorn (version 21.2.0) pynetbox: 7.3.3
Linux vmnb02-test 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Is there a release underway to fix this or should we still refrain from upgrading? Current state of gunicorn dictates that it is not production-worthy. :(
I have been testing a few things and this is my finding. It doesn't help if I decrease the max_requests setting. It just fails sooner. it doesn't help if I increase the max_requests setting either. It still fails at some point. When it fails a log-entry about restarting shows up simultaneously:
Example:
an 09 13:20:07 30001vmnb02-test gunicorn[106082]: [2024-01-09 13:20:07 +0100] [106082] [INFO] Autorestarting worker after current request. Jan 09 13:20:07 30001vmnb02-test gunicorn[106082]: [2024-01-09 13:20:07 +0100] [106082] [INFO] Worker exiting (pid: 106082) Jan 09 13:20:08 30001vmnb02-test gunicorn[106256]: [2024-01-09 13:20:08 +0100] [106256] [INFO] Booting worker with pid: 106256
If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.
Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?
If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.
Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?
I have come to the conclusion, that rather than downgrade gunicorn of maybe loose some necessary features, I will go ahead with max_requests set i 0 and if memory usage becomes an issue on the server I will set up a scheduled job that restarts the worker processes with this command:
ps -aux | grep venv/bin/gunicorn | grep Ss | awk '{ system("kill -HUP " $2 )}'
Just don't pass the max_request option? i never use it myself. It's only when i have a temporary memory leak and never in production.
Le jeu. 11 janv. 2024 à 13:13, Ian S. @.***> a écrit :
If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.
Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?
I have come to the conclusion, that rather than downgrade gunicorn of maybe loose some necessary features, I will go ahead with max_requests set i 0 and if memory usage becomes an issue on the server I will set up a scheduled job that restarts the worker processes with this command:
ps -aux | grep venv/bin/gunicorn | grep Ss | awk '{ system("kill -HUP " $2 )}'
— Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/3038#issuecomment-1887034359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRIW3Y346USVNBLVBBXLYN7JP7AVCNFSM6AAAAAA2SEKTSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXGAZTIMZVHE . You are receiving this because you were mentioned.Message ID: @.***>
For the record, there exists a variety of situation where the memory leaks are difficult to address:
- they're costly to track and fix (in our code bases, provided it comes from there)
- they may be in third-party libs
We stayed on a lower release version to avoid this issue. However, we have to upgrade due to HTTP Request Smuggling (CVE-2024-1135) vulnerability. Is there anyone able to successfully workaround this issue (short of turning off max-requests)?
@rajivramanathan don't use max-requests? Max requests is there for the worst casse when your application leaks.
Benoît, I think we understand your advice, but many apps may find themselves in the "application leaks and we can't do much about it" place, hence the usefulness of max-requests.
We stayed on a lower release version to avoid this issue. However, we have to upgrade due to HTTP Request Smuggling (CVE-2024-1135) vulnerability. Is there anyone able to successfully workaround this issue (short of turning off max-requests)?
We have NGINX in front of Gunicorn so we addressed it by setting up multiple instances of Gunicorn running upstream listening to different ports and using http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream configuration to try next upstream if we encounter 502 error.