gunicorn Intermittent CRITICAL WORKER TIMEOUT errors in a non-containerized environment

trafficstars

Currently, our app servers are consistently showing "CRITICAL WORKER TIMEOUT" errors in the logs with some gunicorn workers being killed with a signal 9, others not. The issue here is that the workers are failing silently - and requests are being dropped. No pattern in terms of which requests are affected.

CPU / memory, disk space all check out across our EC2 instances and our RDS cluster. It's worth noting though this though our test environment sees next to no traffic, we're not seeing this issue in that environment. Could it be a VPC-related issue then?

We've been at this for 3 full days now, if anyone has any ideas, I'm all ears!

Things we've done / tried

Upgrading mysqlclient and gunicorn to their latest versions
Playing around with the timeout values for gunicorn. Anything above 30 seconds brings the system to a halt due to the request queue filling up
Bypassing supervisor and running gunicorn directly from the command-line (being reverse-proxied as usual by Nginx)
Connecting to the DB via its private IP vs using its DNS name
Running gunicorn independently of ddtrace-run and even stopping the datadog-agent service
Killing of Sendgrid webhook notifications to reduce the load on our servers
Rebooting the servers, rebooting the RDS cluster, enabling more logging on RDS and at the application-level (DEBUG vs INFO) and finally, increasing max_allowed_packets for RDS to 256M
We've also checked dmesg, journalctl and our file descriptor stats, everything checks out so far

Jan 23 '22 21:01 lucknerjb

Are you able to share the relevant logs on-failure?

Jan 23 '22 21:01 javabrett

Playing around with the timeout values for gunicorn. Anything above 30 seconds brings the system to a halt due to the request queue filling up

Possie red flag? What worker are you using? Do you have slow clients?

Jan 23 '22 21:01 javabrett

@javabrett Here's a snippet from our supervisor logs and we're using the default sync worker with 16 workers running on a 16 core box. We typically have 5 workers running on the same box size. There shouldn't be any slow clients though we think some requests are waiting for something and are then timing out (making them in essence "slow"). Gunicorn is failing silently though.

DEBUG 2022-01-23 16:32:36,159 authorization 10733 139806107792960 API being used.
[2022-01-23 16:32:36 -0500] [3378] [CRITICAL] WORKER TIMEOUT (pid:9867)
DEBUG 2022-01-23 16:32:36,359 authorization 10516 139806107792960 API being used.
[2022-01-23 16:32:36 -0500] [10826] [INFO] Booting worker with pid: 10826
DEBUG 2022-01-23 16:32:41,324 receivers 10826 139806107792960 ApplicantConsentChangeEvent created: XX - True for applicant: XX
[2022-01-23 16:32:41 -0500] [3378] [CRITICAL] WORKER TIMEOUT (pid:9923)
DEBUG 2022-01-23 16:32:42,021 factory 10608 139806107792960 Loading s3:s3
DEBUG 2022-01-23 16:32:42,024 factory 10608 139806107792960 Loading s3:Bucket
DEBUG 2022-01-23 16:32:42,024 model 10608 139806107792960 Renaming Bucket attribute name
INFO 2022-01-23 16:32:42,157 decorators 10733 139806107792960 Received API call
INFO 2022-01-23 16:32:42,184 cache 10733 139806107792960 Scorecalc cache miss
INFO 2022-01-23 16:32:42,187 cache 10733 139806107792960 Scorecalc cache miss
INFO 2022-01-23 16:32:42,230 decorators 10733 139806107792960 Returning API response

Let me know if you'd like to see a different set of logs

Jan 23 '22 21:01 lucknerjb

With a sync worker all I can suggest is adding some instrumentation to try and determine where requests are going slow or blocking or getting stuck.

Something like pystuck might help:

https://github.com/alonho/pystuck
https://medium.com/@masayuki/getting-thread-dump-of-each-uwsgi-process-9b4c1eab8e15

Jan 25 '22 00:01 javabrett

I have the exact same issue but the workers are timing out even though the app is not receiving any requests

[2022-08-20 14:44:43 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:83127)
[2022-08-20 14:44:43 +0100] [83127] [INFO] Worker exiting (pid: 83127)
[2022-08-20 14:44:43 +0100] [84393] [INFO] Booting worker with pid: 84393
[2022-08-20 15:01:16 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84393)
[2022-08-20 15:01:16 +0100] [84393] [INFO] Worker exiting (pid: 84393)
[2022-08-20 15:01:16 +0100] [84397] [INFO] Booting worker with pid: 84397
[2022-08-20 15:06:41 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84397)
[2022-08-20 15:06:41 +0100] [84397] [INFO] Worker exiting (pid: 84397)
[2022-08-20 15:06:41 +0100] [84403] [INFO] Booting worker with pid: 84403
[2022-08-20 15:24:19 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84403)
[2022-08-20 15:24:19 +0100] [84403] [INFO] Worker exiting (pid: 84403)
[2022-08-20 15:24:19 +0100] [84409] [INFO] Booting worker with pid: 84409
[2022-08-20 15:40:57 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84409)
[2022-08-20 15:40:57 +0100] [84409] [INFO] Worker exiting (pid: 84409)
[2022-08-20 15:40:57 +0100] [84417] [INFO] Booting worker with pid: 84417
[2022-08-20 15:41:42 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84417)
[2022-08-20 15:41:42 +0100] [84417] [INFO] Worker exiting (pid: 84417)
[2022-08-20 15:41:42 +0100] [84420] [INFO] Booting worker with pid: 84420
[2022-08-20 15:42:17 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84420)
[2022-08-20 15:42:17 +0100] [84420] [INFO] Worker exiting (pid: 84420)
[2022-08-20 15:42:17 +0100] [84427] [INFO] Booting worker with pid: 84427
[2022-08-20 15:58:59 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84427)
[2022-08-20 15:58:59 +0100] [84427] [INFO] Worker exiting (pid: 84427)
[2022-08-20 15:58:59 +0100] [84433] [INFO] Booting worker with pid: 84433
[2022-08-20 16:16:42 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84433)
[2022-08-20 16:16:42 +0100] [84433] [INFO] Worker exiting (pid: 84433)
[2022-08-20 16:16:42 +0100] [84439] [INFO] Booting worker with pid: 84439
[2022-08-20 16:33:49 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84439)
[2022-08-20 16:33:49 +0100] [84439] [INFO] Worker exiting (pid: 84439)
[2022-08-20 16:33:49 +0100] [84447] [INFO] Booting worker with pid: 84447
[2022-08-20 16:51:27 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84447)
[2022-08-20 16:51:27 +0100] [84447] [INFO] Worker exiting (pid: 84447)
[2022-08-20 16:51:27 +0100] [84452] [INFO] Booting worker with pid: 84452
[2022-08-20 17:06:56 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84452)
[2022-08-20 17:06:56 +0100] [84452] [INFO] Worker exiting (pid: 84452)
[2022-08-20 17:06:56 +0100] [84460] [INFO] Booting worker with pid: 84460
[2022-08-20 17:08:15 +0100] [83120] [CRITICAL] WORKER TIMEOUT (pid:84460)
[2022-08-20 17:08:16 +0100] [84460] [INFO] Worker exiting (pid: 84460)
[2022-08-20 17:08:16 +0100] [84464] [INFO] Booting worker with pid: 84464

I am running this locally on my computer, and if my computer is idle the workers start timing out.

Aug 20 '22 16:08 ghost

This looked like an application issue. You can always run gunicorn with the --check-config option to first ensure nothing is wrong. A worker timeout means it doesn't notifiy the aribiter it's alive. WHich means that something is blocking in your app unless you see another issue in the logs.

May 07 '23 16:05 benoitc

gunicorn gunicorn copied to clipboard

Intermittent CRITICAL WORKER TIMEOUT errors in a non-containerized environment

gunicorn
gunicorn copied to clipboard