coronavirus-tracker-api icon indicating copy to clipboard operation
coronavirus-tracker-api copied to clipboard

Investigate cause of API outage on April 7 2020

Open Kilo59 opened this issue 4 years ago • 6 comments

Opening this issue to continue the discussion of the Outage on April 7th. #264

Metrics From the time of the Outage and afterwords

I will update this comment with new information as needed.

Notes and resources

  1. https://help.heroku.com/PFSOIDTR/why-am-i-seeing-h12-request-timeouts-high-response-times-in-my-app
  2. https://devcenter.heroku.com/articles/python-gunicorn

Possible causes

  1. memory leak
  2. Async routes have a blocking call that is blocking the event loop.
  3. Gunicorn killing slow workers

Kilo59 avatar Apr 09 '20 10:04 Kilo59

Can you install monitoring software like Munin? Might help to spot issues. My nodes both have a Munin installed. Should a node become unavailable I'll check if I can spot something useful.

toxyl avatar Apr 09 '20 11:04 toxyl

@Kilo59 I was checking the FastAPI repo to see if there are any known issues that might match and I did find one that could be related: https://github.com/tiangolo/fastapi/issues/596 and this might be useful for debugging: https://stackoverflow.com/questions/15455048/releasing-memory-in-python

I've also done a load test on my mirror to see if I can provoke a failure or something else that might shed some light on the issue. On three servers I started in total 72 scripts curling the /all endpoint every second. Eventually one node did stop responding, but it wasn't enough to take either one out entirely.

When starting the load test the average request duration dropped a little (300ms to 220ms), then it started shooting up to 2270ms. After stopping the test it dropped down to 1360ms and stayed there for another 25 minutes and then gradually dropped down to 1310ms and seems to stay there.

IMHO that can have two possible causes: a) The way DO measures the average request duration is flawed and does not reflect the real value b) Something happens on the nodes (within the Docker containers, not the nodes themselves) that causes them to react slower, e.g. open processes / file handles / event loops that are not closed properly. Which I assumed already because I see the average request duration goes up every 1-1.5h hours - obviously that assumption only works if a) is not the case.

I've also checked the container logs but I didn't find anything suspicious.

Now both nodes seem to have cooled down and in my Munin stats I do see an interesting difference between the nodes. On both nodes the active memory consumption went up during the load test and seemed to stabilize after stopping the test. The second node experienced a sharper increase of active memory and now, more than an hour after stopping the test, I see it rising again. Let's see if that continues, if so then we might have a culprit.

toxyl avatar Apr 09 '20 18:04 toxyl

Looks like it has stabilized, here's the graphs:

Munin

cvtapi-munin Ignore the drop in the middle, that was another test.

DigitalOcean

cvtapi-loadbalancer

toxyl avatar Apr 09 '20 20:04 toxyl

Not sure what to make of this. I've started another test with simply curling every minute for bout a day. From the moment I started the test the Average Request Duration steadily dropped, after 12 hours it was down to about 100ms and stayed there. But from 200ms down I saw an increase in 4xx responses. I just stopped the test, let's see if the Average Request Duration starts to increase again.

Memory wise it hadn't stabilized as I thought but slowly kept rising but eventually dropped down.

  1. https://help.heroku.com/PFSOIDTR/why-am-i-seeing-h12-request-timeouts-high-response-times-in-my-app Sometimes this can turn into a bigger problem during restarts. Process forking done by Unicorn can be long/expensive to start up. If these workers take too long to start up or take too long on their first request, Unicorn will kill the processes and start up a new worker. This can result in a seemingly unexplained death spiral of workers being killed on startup, and when there aren't enough workers to handle requests, requests timeout, and workers are once again killed. Since the workers are sent a SIGKILL (no SIGTERM) they aren't able to log any errors during this time, so most of the time you'll just see a SIGKILL in your logs when this happens and nothing else.

That sounds like a potential reason. My Munin does have a couple of stats that should stick out if that happens.

toxyl avatar Apr 11 '20 09:04 toxyl

Duplicate Log messages

@Toxyl @Bost Here's another possible explanation for the duplicate log message. Looks like it was fixed upstream yesterday. Not sure if it's been cut into a new release yet. https://github.com/encode/uvicorn/issues/614#event-3222809178

Memory Leak

https://devcenter.heroku.com/articles/python-gunicorn#max-request-recycling

If your application suffers from memory leaks, you can configure Gunicorn to gracefully restart a worker after it has processed a given number of requests. This can be a convenient way to help limit the effects of the memory leak.

@Toxyl However seems like this seems like it could exacerbate the process killing issue (if that is what was happening).

Sometimes this can turn into a bigger problem during restarts. Process forking done by Unicorn can be long/expensive to start up. If these workers take too long to start up or take too long on their first request, Unicorn will kill the processes and start up a new worker.

The first time a worker restarts and a request is fired it has to build its data cache. I don't know if this is a long enough time to be "killed" but it could be. It might be a good idea to use a start_up/background task to build this cache before the first request of a new worker is processed. https://fastapi.tiangolo.com/advanced/events/ https://fastapi.tiangolo.com/tutorial/background-tasks/

Kilo59 avatar Apr 11 '20 12:04 Kilo59

Just switched over to LogDNA and so far I like it. Looks like we are having worker timeouts pretty frequently.

Apr 13 19:20:28 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:20:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:191)
Apr 13 19:25:15 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:15 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:281)
Apr 13 19:25:29 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:286)
Apr 13 19:25:48 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:25:47 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:291)
Apr 13 19:26:00 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:00 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:296)
Apr 13 19:26:19 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:19 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:301)
Apr 13 19:26:32 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:32 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:306)
Apr 13 19:26:50 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:26:50 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:311)
Apr 13 19:27:03 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:03 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:316)
Apr 13 19:27:21 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:21 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:321)
Apr 13 19:27:34 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:34 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:326)
Apr 13 19:27:53 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:27:52 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:331)
Apr 13 19:28:06 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:05 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:336)
Apr 13 19:28:24 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:23 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:341)
Apr 13 19:28:37 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:37 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:346)
Apr 13 19:28:55 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:28:55 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:351)
Apr 13 19:29:08 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:08 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:356)
Apr 13 19:29:26 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:26 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:361)
Apr 13 19:29:39 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:39 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:366)
Apr 13 19:29:57 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:29:57 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:371)
Apr 13 19:30:10 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:10 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:376)
Apr 13 19:30:29 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:28 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:381)
Apr 13 19:30:42 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:30:41 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:386)
Apr 13 19:31:01 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:01 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:391)
Apr 13 19:31:13 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:13 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:396)
Apr 13 19:31:32 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:32 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:401)
Apr 13 19:31:44 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:31:44 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:406)
Apr 13 19:32:04 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:04 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:411)
Apr 13 19:32:15 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:15 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:416)
Apr 13 19:32:47 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:32:46 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:426)
Apr 13 19:33:19 coronavirus-tracker-api app[web] CRITICAL [2020-04-13 23:33:18 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:431)
Apr 13 19:39:02 coronavirus-tracker-api app[web] ERROR [2020-04-13 23:39:01 +0000] [436] [ERROR] Exception in ASGI application
Apr 13 20:09:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 00:09:01 +0000] [436] [ERROR] Exception in ASGI application
Apr 13 20:33:52 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 00:33:52 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:436)
Apr 13 20:39:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 00:39:01 +0000] [421] [ERROR] Exception in ASGI application
Apr 13 21:00:18 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 01:00:17 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:441)
Apr 13 21:00:49 coronavirus-tracker-api app[web] CRITICAL [2020-04-14 01:00:48 +0000] [4] [CRITICAL] WORKER TIMEOUT (pid:446)
Apr 13 21:09:02 coronavirus-tracker-api app[web] ERROR [2020-04-14 01:09:02 +0000] [421] [ERROR] Exception in ASGI application

Kilo59 avatar Apr 14 '20 01:04 Kilo59