munin icon indicating copy to clipboard operation
munin copied to clipboard

Please get rid of hard-coded timeouts

Open madduck opened this issue 9 years ago • 10 comments
trafficstars

tl;dr: if a node takes a long time, then hard-coded timeouts take precedence over the --timeout command line option.

Details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=786997

madduck avatar Jan 11 '16 08:01 madduck

It looks as if IO::Socket::INET6's Timeout setting is worthless, because if Munin can't connect to a dead node, /usr/share/perl5/Munin/Master/Node.pm will not actually emit the message

    if (! $self->{reader} ) {
            ERROR "Failed to connect to node $self->{address}:$self->{port}/tcp : $!";
            return 0;
    }

It just doesn't seem to happen.

shallot avatar Feb 06 '16 10:02 shallot

In any event, a defensive approach would definitely be to clamp individual worker timeout with the global timeout. What would be the use case of allowing any worker to continue past the user-defined timeout?

shallot avatar Feb 06 '16 10:02 shallot

It should also be noted that plugins/node.d/munin_stats and plugins/node.d/munin_update don't autodetect or allow update_rate changes to be configured, so their warning and critical thresholds are wrong after it's changed, as well as the graph_info text.

shallot avatar Sep 08 '17 09:09 shallot

With munin-async, this gets better, because the master munin-update job seems to spend some time on the live nodes, and then about ~45 seconds into it, mass discards various dead nodes, which then doesn't interfere with a 60s update_rate

However, I'm still worried about the timings and will keep testing with more load

shallot avatar Jul 13 '21 09:07 shallot

OK so after some time, I can say it's better, but not actually working. Clients that stall out ssh connections more than 60s are not actually timed out after the timeout set in the munin-update argument.

I tried adding this:

ssh_command "timeout --foreground --preserve-status --verbose --signal=TERM --kill-after=59s 55s ssh"

Unfortunately that just breaks everything immediately.

Can something be done about this? It's making the log full of FATAL messages because of overruns, when even a miniscule percentage of one's nodes are unreliable.

shallot avatar Nov 17 '21 11:11 shallot

I suppose it needs to be said explicitly - Munin/Master/ProcessManager.pm containing hardcoded numbers for the 3 *timeout variables is clearly wrong given the configurability of update_rate, I don't see any reason why not to fix these to at least be derived from update_rate, if not properly configurable themselves.

This code seems to have been replaced completely in master in commit 5584ba3bea646b3c2d40a5469cad023691563e42 in 2016 but that's still not released

It was made configurable in commit 4624c9e71c64d6a70835a764d0d88d5a9c5dac6d in 2020, probably need the Debian 11 version to test this :)

shallot avatar Nov 17 '21 11:11 shallot

ssh -oConnectTimeout=55 ?

niclan avatar Nov 17 '21 12:11 niclan

ssh -oConnectTimeout=55 ?

That's the initial connect timeout, not the session duration timeout

shallot avatar Nov 17 '21 13:11 shallot

So I've been testing timeout_fetch_all_nodes 50 and timeout_fetch_one_node 40, but it's just not working well. Granted it's not hardcoded, but there's other circumstances in these code paths that obstruct it. Some observations:

The cron job is set to start on the dot, but the initial time to start connecting to clients is 10 seconds later. Guessing this is the bootstrap time, affected by this:

% wc -c /var/lib/munin/*.storable | grep total 129838443 total

Still, 10 seconds to parse 130 MB seems slow.

Then, the max_processes setting naturally stalls out the processing. I doubled it from 16 to 32 to try to improve this, but it's still using something like 2 vCPUs total and 5000 IOPS, and not actually substantially improving throughput much.

Then, the final time between the end of talking to the last worker and the master update process unlocking is also several seconds, so situations like these happen regularly:

2021/11/29 09:43:44 [INFO]: Munin-update finished for node adblockplus.org;filter-in-151.adblockplus.org (27.95 sec) 2021/11/29 09:43:54 [INFO] Remaining workers: adblockplus.org;filter-in-8.adblockplus.org, uplink.eyeo.it;filterlist-crumbs-org-1.uplink.eyeo.it 2021/11/29 09:43:54 [INFO] Reaping Munin::Master::UpdateWorker<adblockplus.org;filter-in-151.adblockplus.org>. Exit value/signal: 0/0 2021/11/29 09:44:01 [FATAL ERROR] Lock already exists: /var/run/munin/munin-update.lock. Dying. 2021/11/29 09:44:01 at /usr/share/perl5/Munin/Master/Update.pm line 127. 2021/11/29 09:44:02 [INFO]: Munin-update finished (60.68 sec) 2021/11/29 09:45:01 [INFO]: Starting munin-update

shallot avatar Nov 29 '21 09:11 shallot

Munin 2.0 does indeed hit a time wall at around 100+ nodes.

You might be able to push it to 150, but painfully.

The 2.999 branch should scale much better, provided a pgsql backend.

But that means that the nodes are async, to make the queries as quick as possible, and not waiting for every plugin to execute.

steveschnepp avatar Dec 07 '21 21:12 steveschnepp