gunicorn icon indicating copy to clipboard operation
gunicorn copied to clipboard

[CRITICAL] WORKER TIMEOUT when using Docker Compose

Open hughhan1 opened this issue 3 years ago • 7 comments

Problem

I have a Flask app that I'm running using Gunicorn. When I run it locally on my machine, it starts up within 5 seconds. But when I run it via Docker Compose on a container, startup exceeds the default timeout of 30 seconds, causing the workers to timeout and exit.

Why does startup time take so long when running via Docker Compose?

Notes

  • I am aware that the --timeout option allows more time for the Gunicorn workers to start successfully. I am not interested in starting my workers eventually; I am interested in starting them within 30 seconds on a Docker container.
  • I've seen some resources suggesting to use gevent. This is likely not an option for me, because I likely have code (or may have code in the future) that is incompatible with gevent.

Potentially Helpful Information

What I've Tried

  • I've set what I think are sufficient limits for both CPU and memory for my Flask app within compose.yml.
  • I've set --worker-tmp-dir=/dev/shm, after reading https://pythonspeed.com/articles/gunicorn-in-docker/, but this did not help with startup time.
  • After configuring a more relaxed timeout (via --timeout), it takes about 2 minutes for the workers to successfully start. After they start, there does not seem to be any slowness with respect to serving incoming requests.

Logs

[2022-05-16 16:59:57 +0000] [1] [INFO] Starting gunicorn 20.0.4
[2022-05-16 16:59:57 +0000] [1] [DEBUG] Arbiter booted
[2022-05-16 16:59:57 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2022-05-16 16:59:57 +0000] [1] [INFO] Using worker: threads
[2022-05-16 16:59:57 +0000] [10] [INFO] Booting worker with pid: 10
[2022-05-16 16:59:57 +0000] [12] [INFO] Booting worker with pid: 12
[2022-05-16 16:59:57 +0000] [14] [INFO] Booting worker with pid: 14
[2022-05-16 16:59:57 +0000] [16] [INFO] Booting worker with pid: 16
[2022-05-16 16:59:57 +0000] [18] [INFO] Booting worker with pid: 18
[2022-05-16 16:59:57 +0000] [1] [DEBUG] 5 workers
[2022-05-16 17:00:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:10)
[2022-05-16 17:00:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12)
[2022-05-16 17:00:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:14)
[2022-05-16 17:00:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:16)
[2022-05-16 17:00:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:18)
[2022-05-16 17:00:27 +0000] [16] [INFO] Worker exiting (pid: 16)
[2022-05-16 17:00:27 +0000] [12] [INFO] Worker exiting (pid: 12)
[2022-05-16 17:00:27 +0000] [14] [INFO] Worker exiting (pid: 14)
[2022-05-16 17:00:27 +0000] [10] [INFO] Worker exiting (pid: 10)
[2022-05-16 17:00:27 +0000] [18] [INFO] Worker exiting (pid: 18)

gunicorn.py

import multiprocessing

bind = f":8000"  # noqa

hard_coded_cpu_limit = min(multiprocessing.cpu_count(), 2) 
workers = hard_coded_cpu_limit * 2 + 1
threads = 4 

app.py

if __name__ == "__main__":
    Flask("my_app").run(host="127.0.0.1", port="8000", debug=True)

compose.yml

version: "3.9"

services:
  cloud-sql-proxy:
    container_name: "cloud-sql-proxy"
    image: gcr.io/cloudsql-docker/gce-proxy:1.30.1
    platform: linux/x86_64
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
    volumes:
      - ${PWD}/secrets/<REDACTED>
      - ${PWD}/cloudsql:/cloudsql
    ports:
      - 3306:3306
    command:
      [
        "/cloud_sql_proxy",
        "-dir=/cloudsql",
        "-instances=<REDACTED>=unix:cloud-sql-proxy.sock",
        "-credential_file=/secrets/<REDACTED>"
      ]

  my-app:
    container_name: my-app
    build: ${PWD}
    platform: linux/x86_64
    deploy:
      resources:
        limits:
          cpus: '8'
          memory: 8G
    volumes:
      - ${PWD}/secrets/<REDACTED>
      - ${PWD}/cloudsql:/cloudsql
      - ${PWD}:/app
    ports:
      # NOTE: map port 8000 on container to 8080 on host machine
      - 8080:8000
    command:
      [
        "pipenv",
        "run",
        "gunicorn",
        "app:app",
        "--name=my-app",
        "--config=gunicorn.py",
        "--worker-tmp-dir=/dev/shm",
        "--reload"
      ]
    restart: always
    env_file:
      - ${PWD}/.env

hughhan1 avatar May 16 '22 17:05 hughhan1

It seems like https://stackoverflow.com/questions/58942398/docker-running-a-flask-app-via-gunicorn-worker-timeouts-poor-performance is a similar post, but changing the worker temp directory doesn't fix the issue for me. 😢

hughhan1 avatar May 27 '22 20:05 hughhan1

I have experienced a similar problem, where I was seing the 30s delay (default timeout setting) on startup of a Dash app behind gunicorn in docker. The worker was restarted, and the app started serving. I have been able to reproduce this behaviour with a much simpler Flask app:

app.py

from flask import Flask

app = Flask(__name__)


@app.route("/")
def index():
    return "Hello, world!"

requirements.txt

click==8.1.3
Flask==2.1.2
gunicorn==20.1.0
itsdangerous==2.1.2
Jinja2==3.1.2
MarkupSafe==2.1.1
Werkzeug==2.1.2

Dockerfile

FROM python:3.10.4-slim-bullseye
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
CMD gunicorn -b 0.0.0.0:80 app:app --log-level debug --timeout 5

Environment and setup

$ docker --version
Docker version 20.10.16, build aa7e414

Building image:

docker build -t test_flask_gunicorn_docker .

Running container:

$ docker run -p 8080:80 --rm test_flask_gunicorn_docker
[2022-06-19 18:25:07 +0000] [7] [DEBUG] Current configuration:
  config: ./gunicorn.conf.py
  wsgi_app: None
  bind: ['0.0.0.0:80']
  backlog: 2048
  workers: 1
  worker_class: sync
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 5
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  print_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 0
  group: 0
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  dogstatsd_tags: 
  statsd_prefix: 
  proc_name: None
  default_proc_name: app:app
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0xffff83a17be0>
  on_reload: <function OnReload.on_reload at 0xffff83a17d00>
  when_ready: <function WhenReady.when_ready at 0xffff83a17e20>
  pre_fork: <function Prefork.pre_fork at 0xffff83a17f40>
  post_fork: <function Postfork.post_fork at 0xffff83a300d0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0xffff83a301f0>
  worker_int: <function WorkerInt.worker_int at 0xffff83a30310>
  worker_abort: <function WorkerAbort.worker_abort at 0xffff83a30430>
  pre_exec: <function PreExec.pre_exec at 0xffff83a30550>
  pre_request: <function PreRequest.pre_request at 0xffff83a30670>
  post_request: <function PostRequest.post_request at 0xffff83a30700>
  child_exit: <function ChildExit.child_exit at 0xffff83a30820>
  worker_exit: <function WorkerExit.worker_exit at 0xffff83a30940>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0xffff83a30a60>
  on_exit: <function OnExit.on_exit at 0xffff83a30b80>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: None
  raw_paste_global_conf: []
  strip_header_spaces: False
[2022-06-19 18:25:07 +0000] [7] [INFO] Starting gunicorn 20.1.0
[2022-06-19 18:25:07 +0000] [7] [DEBUG] Arbiter booted
[2022-06-19 18:25:07 +0000] [7] [INFO] Listening at: http://0.0.0.0:80 (7)
[2022-06-19 18:25:07 +0000] [7] [INFO] Using worker: sync
[2022-06-19 18:25:07 +0000] [8] [INFO] Booting worker with pid: 8
[2022-06-19 18:25:07 +0000] [7] [DEBUG] 1 workers
[2022-06-19 18:26:06 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:8)
[2022-06-19 18:26:06 +0000] [8] [INFO] Worker exiting (pid: 8)
[2022-06-19 18:26:07 +0000] [9] [INFO] Booting worker with pid: 9
[2022-06-19 18:26:07 +0000] [9] [DEBUG] GET /
[2022-06-19 18:26:18 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:9)
[2022-06-19 18:26:18 +0000] [9] [INFO] Worker exiting (pid: 9)
[2022-06-19 18:26:18 +0000] [10] [INFO] Booting worker with pid: 10
[2022-06-19 18:26:18 +0000] [10] [DEBUG] GET /

As you can see from the logs, I wait a bit before I request /. This results in the app hanging for 5 seconds. When I do a refresh again, same happens. This is using

  • Chrome (Version 102.0.5005.115 (Official Build) (arm64))
  • MacBook Pro (M1 Pro) running MacOS 12.4.

anonym79 avatar Jun 19 '22 19:06 anonym79

I figured out something interesting. The WORKER TIMEOUT does not appear to happen when I access the app in Safari (Version 15.5 (17613.2.7.1.8)).

Edit: add request headers for Chrome and Safari.

Request headers Chrome:

GET / HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8
Cache-Control: max-age=0
Connection: keep-alive
Host: 127.0.0.1:8080
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"

Request headers Safari:

GET / HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Upgrade-Insecure-Requests: 1
Host: 127.0.0.1:8080
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.5 Safari/605.1.15
Accept-Language: en-GB,en;q=0.9
Accept-Encoding: gzip, deflate
Connection: keep-alive

anonym79 avatar Jun 19 '22 19:06 anonym79

512M seems too small to run python on docker. try increase the memory size or the cpus. Not sure what 0.5 means there. Why not 1?

benoitc avatar Jun 20 '22 22:06 benoitc

I was able to reproduce the problem on my linux mint 20.3 using @anonym79 example.

I confirm the worker is restarted 5s after the request when the client is Chrome. It does not happen when using Firefox or simple curl.

Furthermore, I'm even able to reproduce the problem without docker! Simply running:

gunicorn --bind=0.0.0.0:8000 --timeout=5 app:app

and accessing locahost:8000 in chrome triggers the worker restart.

tiagocoutinho avatar Jun 25 '22 08:06 tiagocoutinho

I did some digging and this is my understanding:

your using the sync worker-class. This means:

  • only one client is handled at a time
  • after each request is handled the socket is closed by the gunicorn worker (it does not honor any Keep-Alive in the header as stated in gunicorn documentation)

Apparently, browsers like Firefox and Safari are ok with that and will only attempt another connection when explicitly asked (pressing refresh for example)

But it looks like Chrome detects that the socket is closed and, despite having the response for its request, it tries re-connect immediately after. My guess is it assumes that similar requests should come in the future so it tries to preemptively connect in a "smart" attempt to save connection time on the next request.

The same behavior can be reproduced by starting gunicorn:

gunicorn --bind=0.0.0.0:8000 --timeout=5 app:app

and in a second console do:

nc 0 8000

after 5s you should see the worker timeout.

There are multiple strategies to handle this depending on what you are trying to achieve. The simplest would be to launch gunicorn with multi-thread worker:

gunicorn --bind=0.0.0.0:8000 --timeout=5 --threads=10 app:app

(would start blocking after 10 concurrent chrome clients)

or, if your software stack allows it, use a gevent based worker:

gunicorn --bind=0.0.0.0:8000 --timeout=5 --worker-class=gevent app:app

tiagocoutinho avatar Jun 25 '22 14:06 tiagocoutinho

That was an interesting find @tiagocoutinho, thank you. I will run with more workers and threads, but I was curious to find out why this happened in the first place.

anonym79 avatar Jul 01 '22 13:07 anonym79

I had a similar issue where the CPU resources were being choked by another process. However, the issue was resolved after I killed those processes.

mhrznamn068 avatar May 24 '23 08:05 mhrznamn068

Issue progress seems to have stalled. Closing for now. Please open another issue if you have a related problem.

tilgovi avatar Dec 29 '23 03:12 tilgovi