coronavirus-tracker-api
coronavirus-tracker-api copied to clipboard
Investigate cause of API outage on April 7 2020
Opening this issue to continue the discussion of the Outage on April 7th. #264
Metrics From the time of the Outage and afterwords
I will update this comment with new information as needed.
Notes and resources
- https://help.heroku.com/PFSOIDTR/why-am-i-seeing-h12-request-timeouts-high-response-times-in-my-app
- https://devcenter.heroku.com/articles/python-gunicorn
Possible causes
- memory leak
- Async routes have a blocking call that is blocking the event loop.
- Gunicorn killing slow workers
Can you install monitoring software like Munin? Might help to spot issues. My nodes both have a Munin installed. Should a node become unavailable I'll check if I can spot something useful.
@Kilo59 I was checking the FastAPI repo to see if there are any known issues that might match and I did find one that could be related: https://github.com/tiangolo/fastapi/issues/596 and this might be useful for debugging: https://stackoverflow.com/questions/15455048/releasing-memory-in-python
I've also done a load test on my mirror to see if I can provoke a failure or something else that might shed some light on the issue. On three servers I started in total 72 scripts curling the /all
endpoint every second. Eventually one node did stop responding, but it wasn't enough to take either one out entirely.
When starting the load test the average request duration
dropped a little (300ms to 220ms), then it started shooting up to 2270ms. After stopping the test it dropped down to 1360ms and stayed there for another 25 minutes and then gradually dropped down to 1310ms and seems to stay there.
IMHO that can have two possible causes:
a) The way DO measures the average request duration
is flawed and does not reflect the real value
b) Something happens on the nodes (within the Docker containers, not the nodes themselves) that causes them to react slower, e.g. open processes / file handles / event loops that are not closed properly. Which I assumed already because I see the average request duration
goes up every 1-1.5h hours - obviously that assumption only works if a) is not the case.
I've also checked the container logs but I didn't find anything suspicious.
Now both nodes seem to have cooled down and in my Munin stats I do see an interesting difference between the nodes. On both nodes the active
memory consumption went up during the load test and seemed to stabilize after stopping the test. The second node experienced a sharper increase of active
memory and now, more than an hour after stopping the test, I see it rising again. Let's see if that continues, if so then we might have a culprit.
Looks like it has stabilized, here's the graphs:
Munin
Ignore the drop in the middle, that was another test.
DigitalOcean
Not sure what to make of this. I've started another test with simply curling every minute for bout a day. From the moment I started the test the Average Request Duration
steadily dropped, after 12 hours it was down to about 100ms and stayed there. But from 200ms down I saw an increase in 4xx responses. I just stopped the test, let's see if the Average Request Duration
starts to increase again.
Memory wise it hadn't stabilized as I thought but slowly kept rising but eventually dropped down.
- https://help.heroku.com/PFSOIDTR/why-am-i-seeing-h12-request-timeouts-high-response-times-in-my-app Sometimes this can turn into a bigger problem during restarts. Process forking done by Unicorn can be long/expensive to start up. If these workers take too long to start up or take too long on their first request, Unicorn will kill the processes and start up a new worker. This can result in a seemingly unexplained death spiral of workers being killed on startup, and when there aren't enough workers to handle requests, requests timeout, and workers are once again killed. Since the workers are sent a SIGKILL (no SIGTERM) they aren't able to log any errors during this time, so most of the time you'll just see a SIGKILL in your logs when this happens and nothing else.
That sounds like a potential reason. My Munin does have a couple of stats that should stick out if that happens.
Duplicate Log messages
@Toxyl @Bost Here's another possible explanation for the duplicate log message. Looks like it was fixed upstream yesterday. Not sure if it's been cut into a new release yet. https://github.com/encode/uvicorn/issues/614#event-3222809178
Memory Leak
https://devcenter.heroku.com/articles/python-gunicorn#max-request-recycling
If your application suffers from memory leaks, you can configure Gunicorn to gracefully restart a worker after it has processed a given number of requests. This can be a convenient way to help limit the effects of the memory leak.
@Toxyl However seems like this seems like it could exacerbate the process killing issue (if that is what was happening).
Sometimes this can turn into a bigger problem during restarts. Process forking done by Unicorn can be long/expensive to start up. If these workers take too long to start up or take too long on their first request, Unicorn will kill the processes and start up a new worker.
The first time a worker restarts and a request is fired it has to build its data cache. I don't know if this is a long enough time to be "killed" but it could be. It might be a good idea to use a start_up/background task to build this cache before the first request of a new worker is processed. https://fastapi.tiangolo.com/advanced/events/ https://fastapi.tiangolo.com/tutorial/background-tasks/
Just switched over to LogDNA and so far I like it. Looks like we are having worker timeouts pretty frequently.
Apr 13 19:20:28 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:20:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:191)
Apr 13 19:25:15 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:15 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:281)
Apr 13 19:25:29 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:286)
Apr 13 19:25:48 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:47 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:291)
Apr 13 19:26:00 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:00 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:296)
Apr 13 19:26:19 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:19 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:301)
Apr 13 19:26:32 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:32 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:306)
Apr 13 19:26:50 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:50 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:311)
Apr 13 19:27:03 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:03 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:316)
Apr 13 19:27:21 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:21 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:321)
Apr 13 19:27:34 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:34 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:326)
Apr 13 19:27:53 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:52 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:331)
Apr 13 19:28:06 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:05 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:336)
Apr 13 19:28:24 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:23 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:341)
Apr 13 19:28:37 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:37 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:346)
Apr 13 19:28:55 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:55 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:351)
Apr 13 19:29:08 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:08 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:356)
Apr 13 19:29:26 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:26 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:361)
Apr 13 19:29:39 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:39 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:366)
Apr 13 19:29:57 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:57 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:371)
Apr 13 19:30:10 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:10 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:376)
Apr 13 19:30:29 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:381)
Apr 13 19:30:42 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:41 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:386)
Apr 13 19:31:01 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:01 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:391)
Apr 13 19:31:13 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:13 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:396)
Apr 13 19:31:32 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:32 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:401)
Apr 13 19:31:44 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:44 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:406)
Apr 13 19:32:04 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:04 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:411)
Apr 13 19:32:15 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:15 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:416)
Apr 13 19:32:47 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:46 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:426)
Apr 13 19:33:19 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:33:18 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:431)
Apr 13 19:39:02 coronavirus-tracker-api app[web] ERROR [2020-04-13 23:39:01 +0000] [436] [ERROR] Exception in ASGI application
Apr 13 20:09:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 00:09:01 +0000] [436] [ERROR] Exception in ASGI application
Apr 13 20:33:52 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 00:33:52 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:436)
Apr 13 20:39:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 00:39:01 +0000] [421] [ERROR] Exception in ASGI application
Apr 13 21:00:18 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 01:00:17 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:441)
Apr 13 21:00:49 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 01:00:48 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:446)
Apr 13 21:09:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 01:09:02 +0000] [421] [ERROR] Exception in ASGI application