munin
munin copied to clipboard
Please get rid of hard-coded timeouts
tl;dr: if a node takes a long time, then hard-coded timeouts take precedence over the --timeout command line option.
Details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=786997
It looks as if IO::Socket::INET6's Timeout setting is worthless, because if Munin can't connect to a dead node, /usr/share/perl5/Munin/Master/Node.pm will not actually emit the message
if (! $self->{reader} ) {
ERROR "Failed to connect to node $self->{address}:$self->{port}/tcp : $!";
return 0;
}
It just doesn't seem to happen.
In any event, a defensive approach would definitely be to clamp individual worker timeout with the global timeout. What would be the use case of allowing any worker to continue past the user-defined timeout?
It should also be noted that plugins/node.d/munin_stats and plugins/node.d/munin_update don't autodetect or allow update_rate changes to be configured, so their warning and critical thresholds are wrong after it's changed, as well as the graph_info text.
With munin-async, this gets better, because the master munin-update job seems to spend some time on the live nodes, and then about ~45 seconds into it, mass discards various dead nodes, which then doesn't interfere with a 60s update_rate
However, I'm still worried about the timings and will keep testing with more load
OK so after some time, I can say it's better, but not actually working. Clients that stall out ssh connections more than 60s are not actually timed out after the timeout set in the munin-update argument.
I tried adding this:
ssh_command "timeout --foreground --preserve-status --verbose --signal=TERM --kill-after=59s 55s ssh"
Unfortunately that just breaks everything immediately.
Can something be done about this? It's making the log full of FATAL messages because of overruns, when even a miniscule percentage of one's nodes are unreliable.
I suppose it needs to be said explicitly - Munin/Master/ProcessManager.pm containing hardcoded numbers for the 3 *timeout variables is clearly wrong given the configurability of update_rate, I don't see any reason why not to fix these to at least be derived from update_rate, if not properly configurable themselves.
This code seems to have been replaced completely in master in commit 5584ba3bea646b3c2d40a5469cad023691563e42 in 2016 but that's still not released
It was made configurable in commit 4624c9e71c64d6a70835a764d0d88d5a9c5dac6d in 2020, probably need the Debian 11 version to test this :)
ssh -oConnectTimeout=55 ?
ssh -oConnectTimeout=55 ?
That's the initial connect timeout, not the session duration timeout
So I've been testing timeout_fetch_all_nodes 50 and timeout_fetch_one_node 40, but it's just not working well. Granted it's not hardcoded, but there's other circumstances in these code paths that obstruct it. Some observations:
The cron job is set to start on the dot, but the initial time to start connecting to clients is 10 seconds later. Guessing this is the bootstrap time, affected by this:
% wc -c /var/lib/munin/*.storable | grep total 129838443 total
Still, 10 seconds to parse 130 MB seems slow.
Then, the max_processes setting naturally stalls out the processing. I doubled it from 16 to 32 to try to improve this, but it's still using something like 2 vCPUs total and 5000 IOPS, and not actually substantially improving throughput much.
Then, the final time between the end of talking to the last worker and the master update process unlocking is also several seconds, so situations like these happen regularly:
2021/11/29 09:43:44 [INFO]: Munin-update finished for node adblockplus.org;filter-in-151.adblockplus.org (27.95 sec) 2021/11/29 09:43:54 [INFO] Remaining workers: adblockplus.org;filter-in-8.adblockplus.org, uplink.eyeo.it;filterlist-crumbs-org-1.uplink.eyeo.it 2021/11/29 09:43:54 [INFO] Reaping Munin::Master::UpdateWorker<adblockplus.org;filter-in-151.adblockplus.org>. Exit value/signal: 0/0 2021/11/29 09:44:01 [FATAL ERROR] Lock already exists: /var/run/munin/munin-update.lock. Dying. 2021/11/29 09:44:01 at /usr/share/perl5/Munin/Master/Update.pm line 127. 2021/11/29 09:44:02 [INFO]: Munin-update finished (60.68 sec) 2021/11/29 09:45:01 [INFO]: Starting munin-update
Munin 2.0 does indeed hit a time wall at around 100+ nodes.
You might be able to push it to 150, but painfully.
The 2.999 branch should scale much better, provided a pgsql backend.
But that means that the nodes are async, to make the queries as quick as possible, and not waiting for every plugin to execute.