Exporter unavailable during socket timeouts
Hi,
we are using this exporter for a while and I notice patterns, where exporter start logging errors (lines before end), then exporter hang as well. Is there any debug option how could I can track this more?
I would expect behavior that exporter should be responsive all the time, but report php_fpm_up{socket_path="..."} 0. Now we loosing all data when single socket is down.
Using exporter: phpfpm_exporter, version 0.5.0 (branch: HEAD, revision: 9cb855b0e40f98db3bfae60d34d7b87834329310)
Started under supervisor:
command=/opt/prometheus/bin/php-fpm_exporter --web.listen-address=":XXXX"
--phpfpm.status-path=/fpm-status
--phpfpm.socket-directories=/var/lib/php/7.X/fpm/
# we use directories because on some machines we have 200+ sockets for different web sites
Exporter error log (Prague so UTC+2):
2020/10/20 10:14:24 Failed to scrape socket: dial unix /var/lib/php/7.2/fpm/sock.sock: connect: resource temporarily unavailable
...
2020/10/20 10:16:05 Failed to scrape socket: dial unix /var/lib/php/7.2/fpm/sock.sock: connect: resource temporarily unavailable
Screens from grafana dashboard (Prague so UTC+2):

Prometheus graf on up metric for job (UTC):

thanks for submitting an Issue @frenkye
Not aware of any flag you could use to further debug, but I'm not actively developing Go or prometheus exporters.
Since you mentioned you're monitoring a large number of hosts, perhaps you'd need tighter timeouts when opening connections?
https://github.com/tomasen/fcgi_client
func DialTimeout(network, address string, timeout time.Duration) (fcgi *FCGIClient, err error)
If you replace the Dial invocation with DialTimeout, and specify a shorter timeout (say 2/5 sec) would that help?
@Lusitaniae Thank you for the tip. I'll give it a look and try this change. 👍
@Lusitaniae I have set up test enviroment where i have dummy page with php sleep(10) via php-fpm with max_children = 1 for tracing requests.
Changed both Dials for timeout with timeout 2*time.Second, but it has no effect on exporter behavior. When I access page with sleep, then exporter is waiting exactly ~10s for my web request to finish to allow request from exporter.
That seemed wierd.
I did some checking via strace
REQUEST_METHOD="GET" SCRIPT_NAME="/fpm-status" SCRIPT_FILENAME="/fpm-status" QUERY_STRING="full" strace cgi-fcgi -bind -connect /path/to/sock
On my test, during sleep request:
....
socket(AF_UNIX, SOCK_STREAM, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/lib/php/7.2/fpm/sock.sock"}, 35) = 0
write(3, "\1\1\0\1\0\10\0\0\0\1\0\0\0\0\0\0", 16) = 16
write(3, "\1\4\0\1\t\\\4\0\f\4QUERY_STRINGfull\v\vSCRI"..., 2416) = 2416
fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
select(4, [3], [3], NULL, NULL) = 1 (out [3])
write(3, "\1\5\0\1\0\0\0\0", 8) = 8
select(4, [3], [], NULL, NULL^Cstrace: Process 8120 detached
<detached ...>
On server which had target down and fpm get overfilled by requests with full backlog:
...
socket(AF_UNIX, SOCK_STREAM, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/lib/php/7.2/fpm/sock.sock"}, 47^Cstrace: Process 844 detached
<detached ...>
It seems its not problem in connection to socket, but in wating for the query on socket to finish. Because backlog allow this connection before is overflooded and start rejecting connections.
Any idea how to limit the query on socket for like 5s and not the connection?
Good progress so far.
I had another look at the fcgi client and the net interface https://golang.org/pkg/net/#UnixConn.SetDeadline
It offers some setDeadline methods for establishing TCP/UDP/socket connections.
Perharps the fcgi_client needs to implement those, which it doesn't at the moment.
I had conversation with our dev team and they will have loot at this in few days. If we find solution, they will make PR.