misc-server
misc-server copied to clipboard
blog and wiki keep going down
According to StatusCake these are constantly going down. Blog seems a bit worse than wiki.
According to the DigitalOcean logs everything is fine. They're getting a bit more traffic than normal, maybe 10-20 requests per minute, but all responses are 200s supposedly. Blog is at about 40% RAM usage and 20% CPU usage; wiki is at about 30% RAM and 60% CPU usage; and the shared database server is at at about 68% RAM and 12% CPU usage.
There was a major spike in incoming connections and CPU/RAM usage last night around 19:11 Eastern Time, but the outages started getting bad around 17:26 Eastern Time so I'm not sure if it's related.
My best hypothesis is that either DigitalOcean sucks, or something about our setup sucks, and can't handle this much traffic.
Potential ideas:
- Bump up the server resources even more. Seems unlikely to help given that our RAM/CPU usage is not that high. Although maybe upgrading from the "basic" tier to "pro" tier gives us access to some less-flaky type of server. If we pay enough money we could even run two containers per service, load-balanced by DigitalOcean. This might be worth trying as a first attempt just to see if it makes a difference.
- Bump up the database server resources.
- Investigate more complicated in-container caching architectures to reduce the amount of times we hit the database. My understanding was that since DigitalOcean puts a CDN in front of us, sending the right caching expiry headers would cause the CDN to cache the appropriate resources and not hit our source server as much. It seems like this should be enough for relatively-low-traffic sites like ours. But maybe we need to go beyond that somehow and do WordPress/MediaWiki-specific caching stuff.
- Try AWS instead of DigitalOcean.
Could it be a StatusCake issue? Is it easy to catch them being down?
I just got an email about wiki being down and tested it immediately, but it loaded for me. Maybe the response time is too long or something?
I've caught them down a few times, but it's possible the problem is less serious than StatusCake makes it appear, hmm.
When you've seen them down, has trying to reload fixed the problem, or has it been down for minutes at a time?
I'm thinking that maybe this is a warmup problem. Maybe instances are killed or somehow frozen when there hasn't been traffic for a while. This is how AppEngine behaves at least, although it's not the same kind of architecture so it's not exactly the same for sure.
Hmm, it was reported down in https://github.com/whatwg/blog.whatwg.org/issues/12, but works for me now.
@domenic when we last met you said the blog and wiki have stopped going down, but I wonder if really it's just the monitoring behavior that has changed...?
https://blog.whatwg.org/ has been down since (at least) yesterday, FWIW.
I've kicked the control panel again :(
I don't think this is a warmup or monitoring problem. I think this is either:
- DigitalOcean is bad at keeping uptime for its app platform; or
- The very-simple Docker image we have for the blog (basically just the WordPress base Docker image + its themes) is not production-quality in some way, and falls down very easily.
I think the next step here would be to try setting up the same simple Docker image on another service provider (e.g. AWS), and pointing blog.whatwg.org to that deployment for a few months, and seeing if it's better. That would narrow down whether it's (1) or (2).
It’s down again.
FYI, it's down again. (Let me know if this is not the best place to post alerts.)
So I narrowed down this problem to something about the CDN fronting the blog and wiki. Right now the blog at https://blog.whatwg.org/ is down. However the deployment URL https://blog-6tqz3.ondigitalocean.app/ is up. And all the logs show 200 requests to the internal URL.
I am going to try contacting DigitalOcean support since this seems like it is not our problem. (I.e., it is not the server being overloaded because we don't have enough caching, or something like that.)
DigitalOcean support has been unhelpful both times I tried pointing them at live outages.
Today I was pointed to https://fly.io/ which seems really promising?? Maybe we should try switching to that. We can switch just the sites at first, having them connect to the existing DigitalOcean database, and if that works then switch the database too.