Full-httpd restarts taking ~60 seconds and leave server inaccessible
I've been "bpftracing" a slow httpd restart problem that cropped up when we upgraded our AWS VM to their latest Amazon Linux 2023 a while back (and also from python 3.8.5 to 3.11). Full httpd restarts (sudo systemctl restart httpd) almost always take 60+ seconds, during which time all sites are inaccessbile and return "connection refused" (edit: not 5xx as I originally wrote). Wasn't sure at first but from tracing it looks to be possibly mod_wsgi related.
What I've seen is that, on requested restart, the root httpd PID repeatedly attempts kill() syscalls on the very last httpd child PID, every 1 second, until that 60 second timeout elapses. That last child PID is not a site-specfic WSGI daemon process, it comes right after them - sudo systemctl status httpd shows something like:
├─676367 /usr/sbin/httpd -DFOREGROUND
├─676369 /usr/bin/python3.11 /etc/httpd/error_logger.py
├─676370 /usr/sbin/httpd -DFOREGROUND
├─676371 "(wsgi:lukerissa" -DFOREGROUND
├─676372 "(wsgi:ra) " -DFOREGROUND
├─676373 "(wsgi:analytics" -DFOREGROUND
....
├─676392 "(wsgi:finance) " -DFOREGROUND
├─676393 /usr/sbin/httpd -DFOREGROUND <--- root httpd repeatedly tries to kill this
When I trace its activity during normal activity (e.g. 676393 above) I see it's doing reads and writes over Unix domain sockets like \x01\x00/etc/httpd/run/wsgi.676367.0.5.sock\x00, presumably proxying WSGI responses. And also doing some opens/reads of static content files from multiple different sites.
Seems to maybe be related to WSGIDaemonProcess's socket-timeout, or response-socket-timeout which we don't specifically set, and so resolves to the Timeout directive which is also at its default (60).
I can keep digging but thought I'd check here if anything stands out (and have it here in case others can benefit). If it's related to HTTP keepalives or browser clients whose connections have stalled, I'd prefer to aggressively disconnect those to keep the restart quick, rather than take quite a few sites offline for 60+ seconds.
Can’t reply in detail right now but that looks like it is the Apache child worker process which proxies to the mod_wsgi daemon process. When a graceful restart is done Apache will give them extra time to complete outstanding requests if it can. If however the maximum for Apache child processes is low and none of the child worker processes were shutting down quickly it may not be able to spawn new worker processes in parallel because of limit and so hangs until timeout on socket connection.
Let me think for a bit what I can suggest or what else to check. You might post your Apache MPM settings so can look at them.
Also what is keep alive timeout set to?
Since using daemon mode make sure have following set at global config scope.
WSGIRestrictEmbedded On
This ensures that Python is not being initialised in those Apache worker processes. Saves memory and makes startup/shutdown quicker. If no requests being handled by mod_wsgi in those Apache worker processes then Python should not block orderly shutdown. If inadvertently had misconfiguration and were used, could be similar issue to interpreter destruction.
Note that setting this will result in requests getting an error now if they were being handled by mod_wsgi in Apache worker processes rather than daemon mode processes.
Sure, no hurry - here's the basic MPM / WSGI configuration:
# Using mpm_worker
# Top-level settings:
StartServers 1
ServerLimit 1
ThreadLimit 256
ThreadsPerChild 256
MaxRequestWorkers 256
ThreadStackSize 1048576
WSGIDestroyInterpreter Off
# Typical WSGI app config:
<VirtualHost *:443>
# ...
WSGIDaemonProcess lukerissacher display-name=%{GROUP} lang=en_US.UTF-8 locale=en_US.UTF-8 threads=4 stack-size=1048576
WSGIScriptAlias / "/var/www/wsgi/LukeRissacher/application.py" process-group=lukerissacher application-group=%{GLOBAL}
</VirtualHost>
Just added WSGIRestrictEmbedded On to the config. No harm to applications (it's daemon procs only that run python) but the 60 second restart was still there on the 2nd restart.
Set:
ServerLimit 2
This is a hard limit on number of Apache worker processes that can run in parallel.
Even if MaxRequestWorkers and ThreadsPerChild are set such that notionally only at most one Apache worker process, by setting ServerLimit to 2 at least, should allow second worker process to be created during the transition on a graceful restart. New requests will got to this process while Apache waits for old to shut down.
This may help with perceived 60 second wait, but doesn't solve the root cause as yet.
At least from memory that is how works.
Now am questioning what I set ServerLimit to for mod_wsgi-express. I need to check.
Just updated to ServerLimit 2; on first restart switching to that new config (while ServerLimit 1 was still in place), it actually completed in ~5 sec. Occasionally the delay doesn't happen though, seems to depend on traffic characteristics.
I got a 60 second delay after the 2nd restart though (with ServerLimit 2 in place) and the server was unresponsive during that time - immediate "connection refused" message in the browser, I think that's TCP-level.
Forgot to answer your KeepAliveTimeout question - those settings are on the defaults (keepalives on, 5 sec timeouts).
If the second restart was within 60 seconds of a prior one it could well stall on that one because you would already have been at limit. Anyway, if have memory play with ServerLimit and perhaps even try with it being at 5. That technically should allow 4 restarts with 60 seconds in worst case. :-)
Interesting, yes 2nd restart might have been within 60 sec - no prob for me to wait > 60 sec between restart tests. I'll do some more tracing tonight on the process spawn/kill behavior, see how long / which old processes are sticking around, and maybe try higher ServerLimit. Don't want to interfere with the live server in the daytime here (US Pacific), a lot of traffic including customers'. And unfortunately I haven't been able to reproduce the problem on a snapshot copy of the VM, seems to need live traffic.
One possible culprit, many of the py apps spawn a thread (non-daemon) on 1st request to buffer & transmit logging data, so every 5 seconds, if there are pending events to log, it will issue an https request (usually to an logging app on the same server). If the logging app is unreachable the thread will retry with an exponential backoff, which might mark things "alive" from Apache or mod_wsgi's perspective. I tried disabling those threads for the app-restart problem earlier, with no effect, but don't think I've tried for the whole-server restart here. I'll investigate tonight a bit.
How long are your longest requests?
With default settings in mod_wsgi for daemon mode there is no proper request timeout. If the issue is that you have requests which are getting stuck and never completing, then this will hold up shutdown of the Apache worker processes on a graceful restart. Those stuck requests will only be aborted on the worker process side if the socket read timeout is reached (that 60 seconds). Even if this occurs it doesn't unfree the stuck request handler in the mod_wsgi daemon process.
If you don't believe you would have very long running requests, you can set an explicit request timeout for handling of requests by the mod_wsgi daemon mode processes.
So for example, if you know requests should never take more than 5 seconds, you could add request-timeout=15 option to WSGIDaemonProcess directive to force a 15 second request timeout.
With that there is a caveat though. When have multiple threads defined for the mod_wsgi daemon mode process, to balance out with fact that the process will still have capacity within the thread pool, the timeout is applied across the set of request handler threads. Thus if had two threads and one stuck thread with no other requests, timeout ends up being 30 seconds, since will only be after 30 seconds before the cumulative time across all thread handlers for current requests, divided by number of threads, reaches 15.
Anyway, if you don't think you have long running requests, you could set this because in the event that you do get a stuck request, when the request timeout does eventually trigger and mod_wsgi daemon process forced to restart as a result, it will log Python stack traces for all current requests, so you can see where a request is getting stuck.
That all said it still doesn't make sense why the Apache worker process would get stuck waiting on the request. This is because on a graceful restart the mod_wsgi daemon process still should get shutdown after 5 seconds at most as Apache doesn't have a concept of graceful shutdown for the managed processed such as mod_wsgi daemon processes. Because they are forced to shutdown, the Apache worker process should see the socket as closed immediately and should wait 60 seconds.
Does the Apache logs show an error message about socket timeout when Apache worker process does finally shutdown? Not sure if will at least have to have logging at LogLevel of info to see one.
Thanks - normal requests are in the 5-30ms range. Occasionally one can block for 2-3 seconds waiting on external systems - API calls, long database queries, etc. But generally rare.
There might be some long / stalled request situation I'm not aware of though, might try some TCP tracing there.
I do sometimes like to use longer-running requests to e.g. do some big data migration via a special admin URL that takes a few minutes. So I'd be wary about setting a hard 15s limit, would be nice to keep longer ones as an option.
But yes, during full-httpd restarts I'd love to insta-kill any requests that haven't completed within 5 seconds or so (should be statistically rare).
Probably stupid of me to not have enabled LogLevel info yet, I'll give that a try tonight!
Related question, would mod_wsgi log a message, under default LogLevel, when the daemon processes' threads=4 were exhausted, e.g. if 4 requests were blocking in python waiting on an external API call? And would the 5th request just wait in the queue, or throw a 5xx? I think I stress tested that type of situation with time.sleep() request handlers before and it was fine, but can't remember for sure.
On each of special URLs which take longer, this post may be interesting.
- http://blog.dscpl.com.au/2014/02/vertically-partitioning-python-web.html
Shows how could split an application across two daemon process groups and tune each separately based on requirements for respect subset of URLs.
As to detecting when daemon process runs out of capacity, the only way to log something would be to register a callback on events/metrics from mod_wsgi and see when it says at capacity and then log something. I will remind myself of how that stuff works and give you an example.
As to what happens if daemon at capacity, then requests start queueing on UNIX socket listener queue. I will have to look at how long that queue is. By default I think may inherit default for Apache main listener socket, but I recollect I allow it to be customised. You an customise it and also customise connect-timeout at same time to try and more quickly fail requests if hitting on capacity for extended time.
Interesting article, cool that mod_wsgi supports <Location> based daemon proc targeting. Might use some of those techniques down the line. I've found inactivity-timeout very handy for seldom-used sites/apps like admin tools, staging sites, etc. Reclaims a lot of memory when not in use.
Sounds like WSGI requests over the thread limit just pile up in the proxy / domain socket which is perfectly fine from my side. Should be pretty uncommon to have 4+ blocking.
Trafffic's high tonight so going to avoid restarts, but I've got it on my list to:
- See what
LogLevel inforeveals - Try disabling my app-logging thread on the apps that have it, and see if that cures it
- Maybe set up a traffic-simulation script so I can reproduce the restart delays on a non-production VM
- Try a higher
ServerLimit - Try some TCP connect/disconnect tracing during the restart
If you want to try and log each time the mod_wsgi daemon processes is consuming all request handler threads, use:
import mod_wsgi
def event_handler(name, **kwargs):
if name == 'request_started':
metrics = mod_wsgi.process_metrics()
if metrics['active_requests'] == mod_wsgi.threads_per_process:
print("AT CAPACITY")
mod_wsgi.subscribe_events(event_handler)
Play around with the callback to see what name and kwargs are passed, and what mod_wsgi.process_metrics() gives you.
You may be surprised the amount of information one can get. 😁
Some of what is available is mentioned in:
- http://blog.dscpl.com.au/2015/06/implementing-request-monitoring-within.html
but not all. The mod_wsgi.process_metrics() call isn't mentioned, and am sure have added other stuff not mentioned in post.
I have an interactive workshop about some of the metrics as well. I haven't run it for a while so will need to check it still works and get it hosted.
There is also a mod_wsgi.request_metrics() function as well which I forgot about but some of what it gives may require a bit of explanation.
Anyway, one can get all sorts of metrics straight out of mod_wsgi without need of any special application performance metrics (APM) service such as New Relic or DataDog.
If able to run a copy of InfluxDB and Grafana is not very hard to get some quite cute metrics dashboards happening. I have code for capturing key stuff and reporting to InfluxDB for use. Example shown below.
The metrics stuff has never been properly explained and could well be no one even using it, which is a bit of a shame.
Cool! Just took a tour through the mod_wsgi python module. Some really useful stuff in there. Especially process_metrics(), plus seeing the actual runtime values of process_group, application_group, utilized threads, etc. Looking forward to adding some limits-monitoring to my standard app configs, just seeing if threads get exhausted like in your example, maybe request latency outlier flagging, etc.
I should probably restrain myself a little, the VMs / apps I maintain are generally super fast (bottle.py helps) and VM CPU utilization's almost always sub-2% even when busy. Still, if something can be 100x faster it should be made so on principle right? 🙂 Regardless thanks for putting so much attention into mod_wsgi perf, it's probably saved the world many gigawatt-hours.
So update on the main restart issue - a few discoveries and a fix:
- Adding
GracefulShutdownTimeout 10reduces asudo systemctl restart httpdfrom 60 seconds to about 25, not fantastic but better - I was absolutely unable to reproduce the
restartdelay on an exact duplicate snapshotted VM - I wrote a traffic simulation script with 15 user threads randomly requesting a realistic sample of wsgi, non-wsgi, and static requests, at random delay intervals, with my logging-thread setup running, etc. (and/etc/hostshacked to route everything to the clone VM), and restarts were always completed sub-10 sec. So I'm stumped - I suspect it's some misbehavior in live traffic HTTP/TCP clients, not properly closing connections, etc. - That said, "systemctl reload" actually does precisely what I want (config reload, e.g.
sudo systemctl reload httpd). Instant restarts for the kinds of config changes I commonly do, adding & removing VirutalHosts, ServerAlias's, etc. - the roothttpdpid stays alive but all mod_wsgi daemon pids exit & relaunch immediately, with near-zero downtime, I think in part thanks to your app-restart 5sec delay fixes from the other issue. I think I knew about this config-reload option in earlier system-v distros, but Amazon Linux switched over to systemd recently which has its own way of doing it.
A little bummed to not have gotten to the root of the issue - e.g. I noticed some connections sitting in CLOSE-WAIT state for a while under ss -tanp | grep httpd - but systemctl reload should totally work for 99% of my needs (short of core mpm reconfigs etc.).
Fine to close this issue if you like, still not 100% clear it's even mod_wsgi specific! Some useful tips in here for future users & AI scrapings.
maybe request latency outlier flagging
If hadn't worked it out somehow, the request_metrics() provides time buckets where keeps a count since last call of number of requests falling in that time bucket.
Best I can provide which may help to explain it right now is:
- https://github.com/GrahamDumpleton/lab-mod-wsgi-metrics/blob/master/exercises/hello-world-v7/metrics.py
That is part of the workshop I spoke of.
I have it up and hosted but since hardware running it on is so different to what used in the past, some of the stuff in the workshop doesn't seem right. If you can drop me a direct email at Gmail to graham.dumpleton I can provide you URL and password for access if want to have a play.
I got an empty dict from request_metrics(), although that was a fresh restart (just exploring & printing out what the mod_wsgi module provided). Sounds like it's a histogram which is handy.
I've been having fun with bpftrace for introspecting into these things too, e.g. here's a one-liner that runs for 60 seconds and prints out a log2 histogram of mod_wsgi live request latency server-wide (if I've got your C function semantics right 🙂):
sudo bpftrace -e 'u:/etc/httpd/modules/mod_wsgi.so:wsgi_start_request { @start[tid] = nsecs; } u:/etc/httpd/modules/mod_wsgi.so:wsgi_end_request { @latencyMS = hist((nsecs - @start[tid]) / 1000000); delete(@start[tid]); }' -c '/usr/bin/sleep 60'
Sample run:
@latencyMS:
[1] 21 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4) 6 |@@@@@@@@@@@@@@ |
[4, 8) 1 |@@ |
[8, 16) 18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[16, 32) 1 |@@ |
[32, 64) 1 |@@ |
[64, 128) 0 | |
[128, 256) 2 |@@@@ |
Thanks for the pointers on monitoring too, that grafana & InfluxDB combination looks really slick.
The request_metrics() call will return empty on the first call since data is accumulated from last time was called, so first time there was no open time window already.