gunicorn
gunicorn copied to clipboard
[CRITICAL] WORKER TIMEOUT when using Docker Compose
Problem
I have a Flask app that I'm running using Gunicorn. When I run it locally on my machine, it starts up within 5 seconds. But when I run it via Docker Compose on a container, startup exceeds the default timeout of 30 seconds, causing the workers to timeout and exit.
Why does startup time take so long when running via Docker Compose?
Notes
- I am aware that the
--timeoutoption allows more time for the Gunicorn workers to start successfully. I am not interested in starting my workers eventually; I am interested in starting them within 30 seconds on a Docker container. - I've seen some resources suggesting to use
gevent. This is likely not an option for me, because I likely have code (or may have code in the future) that is incompatible withgevent.
Potentially Helpful Information
What I've Tried
- I've set what I think are sufficient limits for both CPU and memory for my Flask app within
compose.yml. - I've set
--worker-tmp-dir=/dev/shm, after reading https://pythonspeed.com/articles/gunicorn-in-docker/, but this did not help with startup time. - After configuring a more relaxed timeout (via
--timeout), it takes about 2 minutes for the workers to successfully start. After they start, there does not seem to be any slowness with respect to serving incoming requests.
Logs
[2022-05-16 16:59:57 +0000] [1] [INFO] Starting gunicorn 20.0.4
[2022-05-16 16:59:57 +0000] [1] [DEBUG] Arbiter booted
[2022-05-16 16:59:57 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2022-05-16 16:59:57 +0000] [1] [INFO] Using worker: threads
[2022-05-16 16:59:57 +0000] [10] [INFO] Booting worker with pid: 10
[2022-05-16 16:59:57 +0000] [12] [INFO] Booting worker with pid: 12
[2022-05-16 16:59:57 +0000] [14] [INFO] Booting worker with pid: 14
[2022-05-16 16:59:57 +0000] [16] [INFO] Booting worker with pid: 16
[2022-05-16 16:59:57 +0000] [18] [INFO] Booting worker with pid: 18
[2022-05-16 16:59:57 +0000] [1] [DEBUG] 5 workers
[2022-05-16 17:00:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:10)
[2022-05-16 17:00:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12)
[2022-05-16 17:00:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:14)
[2022-05-16 17:00:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:16)
[2022-05-16 17:00:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:18)
[2022-05-16 17:00:27 +0000] [16] [INFO] Worker exiting (pid: 16)
[2022-05-16 17:00:27 +0000] [12] [INFO] Worker exiting (pid: 12)
[2022-05-16 17:00:27 +0000] [14] [INFO] Worker exiting (pid: 14)
[2022-05-16 17:00:27 +0000] [10] [INFO] Worker exiting (pid: 10)
[2022-05-16 17:00:27 +0000] [18] [INFO] Worker exiting (pid: 18)
gunicorn.py
import multiprocessing
bind = f":8000" # noqa
hard_coded_cpu_limit = min(multiprocessing.cpu_count(), 2)
workers = hard_coded_cpu_limit * 2 + 1
threads = 4
app.py
if __name__ == "__main__":
Flask("my_app").run(host="127.0.0.1", port="8000", debug=True)
compose.yml
version: "3.9"
services:
cloud-sql-proxy:
container_name: "cloud-sql-proxy"
image: gcr.io/cloudsql-docker/gce-proxy:1.30.1
platform: linux/x86_64
deploy:
resources:
limits:
cpus: '0.5'
memory: 512M
volumes:
- ${PWD}/secrets/<REDACTED>
- ${PWD}/cloudsql:/cloudsql
ports:
- 3306:3306
command:
[
"/cloud_sql_proxy",
"-dir=/cloudsql",
"-instances=<REDACTED>=unix:cloud-sql-proxy.sock",
"-credential_file=/secrets/<REDACTED>"
]
my-app:
container_name: my-app
build: ${PWD}
platform: linux/x86_64
deploy:
resources:
limits:
cpus: '8'
memory: 8G
volumes:
- ${PWD}/secrets/<REDACTED>
- ${PWD}/cloudsql:/cloudsql
- ${PWD}:/app
ports:
# NOTE: map port 8000 on container to 8080 on host machine
- 8080:8000
command:
[
"pipenv",
"run",
"gunicorn",
"app:app",
"--name=my-app",
"--config=gunicorn.py",
"--worker-tmp-dir=/dev/shm",
"--reload"
]
restart: always
env_file:
- ${PWD}/.env
It seems like https://stackoverflow.com/questions/58942398/docker-running-a-flask-app-via-gunicorn-worker-timeouts-poor-performance is a similar post, but changing the worker temp directory doesn't fix the issue for me. 😢
I have experienced a similar problem, where I was seing the 30s delay (default timeout setting) on startup of a Dash app behind gunicorn in docker. The worker was restarted, and the app started serving. I have been able to reproduce this behaviour with a much simpler Flask app:
app.py
from flask import Flask
app = Flask(__name__)
@app.route("/")
def index():
return "Hello, world!"
requirements.txt
click==8.1.3
Flask==2.1.2
gunicorn==20.1.0
itsdangerous==2.1.2
Jinja2==3.1.2
MarkupSafe==2.1.1
Werkzeug==2.1.2
Dockerfile
FROM python:3.10.4-slim-bullseye
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
CMD gunicorn -b 0.0.0.0:80 app:app --log-level debug --timeout 5
Environment and setup
$ docker --version
Docker version 20.10.16, build aa7e414
Building image:
docker build -t test_flask_gunicorn_docker .
Running container:
$ docker run -p 8080:80 --rm test_flask_gunicorn_docker
[2022-06-19 18:25:07 +0000] [7] [DEBUG] Current configuration:
config: ./gunicorn.conf.py
wsgi_app: None
bind: ['0.0.0.0:80']
backlog: 2048
workers: 1
worker_class: sync
threads: 1
worker_connections: 1000
max_requests: 0
max_requests_jitter: 0
timeout: 5
graceful_timeout: 30
keepalive: 2
limit_request_line: 4094
limit_request_fields: 100
limit_request_field_size: 8190
reload: False
reload_engine: auto
reload_extra_files: []
spew: False
check_config: False
print_config: False
preload_app: False
sendfile: None
reuse_port: False
chdir: /
daemon: False
raw_env: []
pidfile: None
worker_tmp_dir: None
user: 0
group: 0
umask: 0
initgroups: False
tmp_upload_dir: None
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
forwarded_allow_ips: ['127.0.0.1']
accesslog: None
disable_redirect_access_to_syslog: False
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
errorlog: -
loglevel: debug
capture_output: False
logger_class: gunicorn.glogging.Logger
logconfig: None
logconfig_dict: {}
syslog_addr: udp://localhost:514
syslog: False
syslog_prefix: None
syslog_facility: user
enable_stdio_inheritance: False
statsd_host: None
dogstatsd_tags:
statsd_prefix:
proc_name: None
default_proc_name: app:app
pythonpath: None
paste: None
on_starting: <function OnStarting.on_starting at 0xffff83a17be0>
on_reload: <function OnReload.on_reload at 0xffff83a17d00>
when_ready: <function WhenReady.when_ready at 0xffff83a17e20>
pre_fork: <function Prefork.pre_fork at 0xffff83a17f40>
post_fork: <function Postfork.post_fork at 0xffff83a300d0>
post_worker_init: <function PostWorkerInit.post_worker_init at 0xffff83a301f0>
worker_int: <function WorkerInt.worker_int at 0xffff83a30310>
worker_abort: <function WorkerAbort.worker_abort at 0xffff83a30430>
pre_exec: <function PreExec.pre_exec at 0xffff83a30550>
pre_request: <function PreRequest.pre_request at 0xffff83a30670>
post_request: <function PostRequest.post_request at 0xffff83a30700>
child_exit: <function ChildExit.child_exit at 0xffff83a30820>
worker_exit: <function WorkerExit.worker_exit at 0xffff83a30940>
nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0xffff83a30a60>
on_exit: <function OnExit.on_exit at 0xffff83a30b80>
proxy_protocol: False
proxy_allow_ips: ['127.0.0.1']
keyfile: None
certfile: None
ssl_version: 2
cert_reqs: 0
ca_certs: None
suppress_ragged_eofs: True
do_handshake_on_connect: False
ciphers: None
raw_paste_global_conf: []
strip_header_spaces: False
[2022-06-19 18:25:07 +0000] [7] [INFO] Starting gunicorn 20.1.0
[2022-06-19 18:25:07 +0000] [7] [DEBUG] Arbiter booted
[2022-06-19 18:25:07 +0000] [7] [INFO] Listening at: http://0.0.0.0:80 (7)
[2022-06-19 18:25:07 +0000] [7] [INFO] Using worker: sync
[2022-06-19 18:25:07 +0000] [8] [INFO] Booting worker with pid: 8
[2022-06-19 18:25:07 +0000] [7] [DEBUG] 1 workers
[2022-06-19 18:26:06 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:8)
[2022-06-19 18:26:06 +0000] [8] [INFO] Worker exiting (pid: 8)
[2022-06-19 18:26:07 +0000] [9] [INFO] Booting worker with pid: 9
[2022-06-19 18:26:07 +0000] [9] [DEBUG] GET /
[2022-06-19 18:26:18 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:9)
[2022-06-19 18:26:18 +0000] [9] [INFO] Worker exiting (pid: 9)
[2022-06-19 18:26:18 +0000] [10] [INFO] Booting worker with pid: 10
[2022-06-19 18:26:18 +0000] [10] [DEBUG] GET /
As you can see from the logs, I wait a bit before I request /. This results in the app hanging for 5 seconds. When I do a refresh again, same happens. This is using
- Chrome (Version 102.0.5005.115 (Official Build) (arm64))
- MacBook Pro (M1 Pro) running MacOS 12.4.
I figured out something interesting. The WORKER TIMEOUT does not appear to happen when I access the app in Safari (Version 15.5 (17613.2.7.1.8)).
Edit: add request headers for Chrome and Safari.
Request headers Chrome:
GET / HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8
Cache-Control: max-age=0
Connection: keep-alive
Host: 127.0.0.1:8080
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Request headers Safari:
GET / HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Upgrade-Insecure-Requests: 1
Host: 127.0.0.1:8080
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.5 Safari/605.1.15
Accept-Language: en-GB,en;q=0.9
Accept-Encoding: gzip, deflate
Connection: keep-alive
512M seems too small to run python on docker. try increase the memory size or the cpus. Not sure what 0.5 means there. Why not 1?
I was able to reproduce the problem on my linux mint 20.3 using @anonym79 example.
I confirm the worker is restarted 5s after the request when the client is Chrome. It does not happen when using Firefox or simple curl.
Furthermore, I'm even able to reproduce the problem without docker! Simply running:
gunicorn --bind=0.0.0.0:8000 --timeout=5 app:app
and accessing locahost:8000 in chrome triggers the worker restart.
I did some digging and this is my understanding:
your using the sync worker-class. This means:
- only one client is handled at a time
- after each request is handled the socket is closed by the gunicorn worker (it does not honor any Keep-Alive in the header as stated in gunicorn documentation)
Apparently, browsers like Firefox and Safari are ok with that and will only attempt another connection when explicitly asked (pressing refresh for example)
But it looks like Chrome detects that the socket is closed and, despite having the response for its request, it tries re-connect immediately after. My guess is it assumes that similar requests should come in the future so it tries to preemptively connect in a "smart" attempt to save connection time on the next request.
The same behavior can be reproduced by starting gunicorn:
gunicorn --bind=0.0.0.0:8000 --timeout=5 app:app
and in a second console do:
nc 0 8000
after 5s you should see the worker timeout.
There are multiple strategies to handle this depending on what you are trying to achieve. The simplest would be to launch gunicorn with multi-thread worker:
gunicorn --bind=0.0.0.0:8000 --timeout=5 --threads=10 app:app
(would start blocking after 10 concurrent chrome clients)
or, if your software stack allows it, use a gevent based worker:
gunicorn --bind=0.0.0.0:8000 --timeout=5 --worker-class=gevent app:app
That was an interesting find @tiagocoutinho, thank you. I will run with more workers and threads, but I was curious to find out why this happened in the first place.
I had a similar issue where the CPU resources were being choked by another process. However, the issue was resolved after I killed those processes.
Issue progress seems to have stalled. Closing for now. Please open another issue if you have a related problem.