caddy
caddy copied to clipboard
Caddy hangs when php-fpm restarts
Caddy v2.4.6 and likely most/all earlier versions of v2.
I'm using php-fpm as the back-end for processing web requests. php-fpm may occasionally restart the worker processes. IF caddy is under a high load, THEN it won't talk to the new worker processes until after the load stops.
My Caddyfile is pretty straightforward.
{
##debug
https_port 10443
admin off
# TLS options
# self-signed requires: apt install libnss3-tools
local_certs
auto_https disable_redirects
ocsp_stapling off
default_sni localhost
# Available 2021-06-07: https://github.com/caddyserver/caddy/pull/4153
skip_install_trust
order cgi last
}
:10443
{
## Uncomment for assigned certs
tls ../users/certs/cert.pem ../users/certs/cert.key
## Specify the web directory
root ../www
## Use PHP (must be rw by caddy user)
php_fastcgi unix//var/spool/ff/php7.4-fpm.sock
## Enable security interface
@sec not path /server/*
handle @sec {
rewrite * /log.php
}
file_server
}
I run Caddy using: ./caddy run
The stress test (stresser.sh) just spawns 20 GET requests at a time.
#!/bin/bash
count=0
while [ 1 ] ; do
# Upload with analysis
curl -k -A 'stresser' \
'https://localhost:10443/hello.php' > /dev/null 2>&1 &
((count=$count+1))
if [ $count -gt 20 ] ; then
echo "WAIT! $(date)"
wait
count=0
fi
done
The hello.php is nothing more than a hello-world reply.
PHP-FPM's www.conf is configured for 8 static PHP workers. However, the workers can terminate if they take too long (if they are hung).
[www]
listen = /var/spool/ff/php7.4-fpm.sock
listen.owner = www-data
listen.group = www-data
pm = static
pm.max_children = 8
pm.start_servers = 2
pm.min_spare_servers = 1
pm.max_spare_servers = 3
request_terminate_timeout = 60
catch_workers_output = yes
While stresser.sh is running, simulate a PHP worker failure:
killall php-fpm7.4
rm -f /var/spool/ff/php7.4-fpm.pid
/usr/sbin/php-fpm7.4
What happens: Caddy just hangs. It stops calling php.
Stop stresser.sh and wait 2-3 seconds. Then restart stresser.sh. What happens: Caddy works fine.
Caddy appears to not pick up new php worker connections when it's handling incoming HTTP traffic. If the traffic stops -- just for a second or two -- then everything resets and it works fine.
Thanks for the report.
Caddy just hangs. It stops calling php.
Can you be more specific?
Also, I notice you have debug logs commented out. While I'm not sure we have much by way of helpful debug logging in fastcgi specifically, can you please post the logs (with debug enabled) that exhibits the behavior you're describing?
I might also be interested in the goroutine dump from http://localhost:2019/debug/pprof/goroutine?debug=1 and possibly even the full stack dump at http://localhost:2019/debug/pprof/goroutine?debug=2. Maybe attach these as files?
Maybe someone who has PHP installed could try to reproduce and troubleshoot this.
Can you be more specific?
I added to my Caddyfile a handler for some non-PHP code.
respond /server/caddy "Caddy" 200
handle /server/help {
cgi /* /home/caddy/test/server-help.sh
}
respond /server/* "" 200
When the hang happens (server is under a load):
- Accessing /server/caddy works and return the string "Caddy". Shows that the server is running.
- Accessing /server/help runs the bash shell script, so non-PHP calls continue to work.
- Accessing anything *.php hangs until the client times out.
I'll work on getting logs from the failure.
Looks like you're using the cgi directive. Is that from this plugin? https://github.com/aksdb/caddy-cgi
Are you sure the problem isn't with that plugin?
Hi @francislavoie,
I added in cgi (aksdb/caddy-cgi/v2) after noticing the PHP hanging problem. I wanted to see if the problem was specific to PHP or whether it impacts anything else.
My testing steps:
- Started with basic Caddy (no add-ons) and PHP-FPM. That's when I saw the hang.
- Added in 'respond /server/caddy "Caddy" 200' to test the server. When PHP hangs, I could still reach the Caddy server. At the same time, from the command-line, I could use cgi-fcgi to directly access PHP-FPM, so PHP-FPM wasn't hanging. This identified the problem as the communication between Caddy and PHP-FPM.
- I downloaded Caddy with the cgi add-on. I wanted to see if Caddy could still make CGI calls when PHP was hanging. The answer is "yes".
Now for the weird thing: I just enabled debugging as @mholt suggested. When debugging is enabled, I cannot get PHP-FPM to hang. But if I disable debugging, then PHP-FPM usually hangs when the server is under stress and PHP-FPM restarts workers. (As a programmer, I hate it when enabling debugging code makes things work.)
At a glance, it sounds like there's a limited amount of connection slots to the backend, and I think they get filled up with connections that will eventually timeout, causing a deadlock. I didn't read too closely in the code to see where that might be happening, but that's my hunch from reading the issue.
@francislavoie It acts like a race condition, so I definitely think you're onto something.
- I cannot get it to hang if debug is enabled.
- I added some timeouts to the php socket connection. So far, it isn't hanging. (I'm hoping that these are breaking the race condition.)
php_fastcgi unix//var/spool/ff/php7.4-fpm.sock {
dial_timeout 10s
read_timeout 60s
write_timeout 60s
}
A followup to my previous comment. When the link from Caddy to php-fpm fails, Caddy returns a 502 error code. We ran a high volume test that usually triggers the hang. And when it hangs, it hangs forever and every subsequent request returns a 502. With the php_fastcgi timeout settings, we've only seen one 502 error (so far) and it recovered.
As a quick sanity, check, could you try with a build from the master branch? Or you can grab a build artifact from https://github.com/caddyserver/caddy/actions/runs/1953367431.
The dial_timeout has been lowered to 3s for the upcoming release. Is it any better without setting any timeouts?
@francislavoie According to the docs, the default timeout is 0 (no timeout). I think that might be part of the cause for the hang. When I set it to 10s, I only saw one 502 rather than a sudden 502 flood where every request failed.
However, I don't know if the timeout should be dial_timeout, read_timeout, or write_timeout. (I'm suspecting dial_timeout since the connection from Caddy to PHP-FPM seems to be the cause of the hang.)
I just reduced dial_timeout from 10s to 3s. I'm not sure if this is related, but I'm seeing something odd...
tcp 1 0 127.0.0.1:53920 127.0.0.1:37177 CLOSE_WAIT 42/../caddy/caddy
tcp 0 0 127.0.0.1:37177 127.0.0.1:53920 FIN_WAIT2 -
tcp6 0 0 :::10080 :::* LISTEN 42/../caddy/caddy
tcp6 0 0 :::10443 :::* LISTEN 42/../caddy/caddy
tcp6 0 0 172.17.0.2:10443 172.17.0.1:51202 ESTABLISHED 42/../caddy/caddy [1]
tcp6 0 0 172.17.0.2:10443 172.17.0.1:50150 ESTABLISHED 42/../caddy/caddy [2]
tcp6 0 0 172.17.0.2:10443 172.17.0.1:51236 ESTABLISHED 42/../caddy/caddy [3]
tcp6 0 0 172.17.0.2:10443 172.17.0.1:51184 ESTABLISHED 42/../caddy/caddy [4]
tcp6 0 0 172.17.0.2:10443 172.17.0.1:51226 ESTABLISHED 42/../caddy/caddy [5]
tcp6 0 0 172.17.0.2:10443 172.17.0.1:51206 ESTABLISHED 42/../caddy/caddy [6]
tcp6 0 0 172.17.0.2:10443 172.17.0.1:51242 ESTABLISHED 42/../caddy/caddy [7]
That FIN_WAIT2 has been sitting around for over a minute. Caddy isn't cleaning it up. There, it's gone after 2 minutes. I'm wondering if Caddy sometimes doesn't have time to do housekeeping when it's under a high load?
@francislavoie A few things:
- With debug, it never hangs.
- Without debug and default timeouts (0=none), it hangs under a high load when PHP-FPM cycles workers.
- Without debug and 3s dial_timeout and 60s read/write timeouts: If I force php-fpm to restart (simulating cycling workers) it sometimes hangs for up to 60 seconds.
- Using the master caddy build (caddy_Linux_go1.17_d0b608a), I had to disable my cgi calls (cgi not included). When I kill/restart php-fpm, it now throws a few ERROR messages!
2022/03/09 21:51:56.700 ERROR http.log.error.log1 EOF {"request": {"remote_ip": "172.17.0.1", "remote_port": "39542", "proto": "HTTP/2.0", "method": "POST", "host": "localhost:10443", "uri": "/api.php", "headers": {"User-Agent": ["admin"], "Accept": ["*/*"], "Content-Length": ["53565"], "Content-Type": ["multipart/form-data; boundary=------------------------eb4111fc29100e46"]}, "tls": {"resumed": false, "version": 772, "cipher_suite": 4865, "proto": "h2", "server_name": "localhost"}}, "duration": 1.714576445, "status": 502, "err_id": "peytqmtkt", "err_trace": "reverseproxy.statusError (reverseproxy.go:1115)"}This is much better since it isn't hanging!
(Thanks for the thorough debugging/troubleshooting! Very interesting. I have been watching this conversation but have been too busy to reply to it. Just wanted to let you know I'm keeping an eye on it. Carry on. 🙂 )
v2.5.0-beta.1 has just been released today, which is nearly the same as the build from master you tried.
So you're saying that the problem is essentially gone with this build now?
I'm not exactly sure which of the 100 commits between 2.4.6 and 2.5.0-beta.1 would have changed this behaviour. https://github.com/caddyserver/caddy/compare/v2.4.6...v2.5.0-beta.1 Maybe you could try doing a bisect to track down exactly where it started behaving better, if you feel like it?
I've been playing with the various timeouts. I think the problem is with both dial_timeout and read_timeout. With caddy v2.4.6, they default to "0" (no timeout).
I set dial_timeout to 10s, write to 30s, and read to 60s. I've seen hangs that resolve after 10s and 60s, but never 30s.
I haven't looked at the php-fpm code, but I suspect that there is a race condition where caddy can be opening a connection to php-fpm while php-fpm is shutting down an old worker. Either that, or there is a race condition where caddy can connect before the worker is fully spun up.
PHP-FPM can also be configured to kill a worker that takes too long. This doesn't seem to terminate caddy's read. With read_timeout 0, it hangs indefinitely (or until caddy kills the connection due to taking too long to run). With a timeout of 60s, it permits caddy to resolve the issue.
I haven't looked at the php-fpm code: Do you know if php-fpm sends an oob signal when a worker is terminated? And if so, does caddy catch this signal?
One other test: I tried this same experiment with nginx + php-fpm and I cannot cause nginx to hang. I think this deadlock is specific to caddy and not limited to php-fpm.
With caddy v2.4.6, they default to "0" (no timeout).
I don't think this was true, I'm pretty sure the dial timeout was 10s, but it wasn't properly documented as such.
But it's true that read timeouts are not enabled by default.
I haven't looked at the php-fpm code: Do you know if php-fpm sends an oob signal when a worker is terminated? And if so, does caddy catch this signal?
I haven't either. I would hope that php-fpm would close the connection when it resets. But I dunno how it behaves.
Btw, we're aware our fastcgi code isn't the best. We've had https://github.com/caddyserver/caddy/issues/3803 open for a while, wanting to do a refactor/rewrite. But it's really tricky to get right, we're not experts on this protocol. There's very few actual fastcgi client implementations in Go that we can use as inspiration.
I'm having a similar issue when php-fpm gets to its max number of connections (pm.max_children) then caddy will show a 502 - I'm thinking if the fastcgi directive could have a setting similar to the http backend, max_conns_per_host so that caddy would wait for fpm to be available this would be helpful
@mcfedr in that case you're looking for lb_try_duration. I don't think that's the same problem as this issue as far as I can tell.
@hackerfactor Hey, is this still an issue for you? I am wondering if you could please try the latest commits on master (even newer than v2.6.0-beta.3 if possible) as we've improved both the FastCGI transport and the reverse proxy quite a bit.
I'm gonna close this as inactive, since we didn't get an answer from the reporter. If more information is given and we have more evidence of a problem, we can reopen it.
Chiming in here, I've seen a similar behavior in production, with Caddy latest docker binary + PHP-FPM.
When setting pm.max_requests in FPM to a non-zero value, FPM will respawn a child process after x number of requests have occurred.
While I have this value set to a non-zero value, Caddy starts returning 502 during high load events.
@francislavoie I thought about using the lb_try_duration setting that you mentioned, but the documentation say that the lb settings are only for configurations with more than one upstream. Does this setting work for a single upstream as well?
We're running hundreds of servers, but due to infrastructure restriction, each server is 1:1 Caddy+PHP-FPM, single upstream config only.
To reduce the 502 issue in production, I've disabled pm.max_requests for now, and I'm looking into another way to recycle the instances more often via kubernetes configuration to avoid php memory leaks.
Retries do work with single upstreams.