munin
munin copied to clipboard
munin-async spoolfetch should send data earlier
It seems that munin-async --spoolfetch on the client first reads all data and only then sends it over the wire. This is not only memory-intensive, it's also a problem when the reading takes longer than 180 seconds and the socket times out.
I just noticed this in the case of a particularly slow host which was cut off for three days, while munin-asyncd kept collecting. munin-async in this case is unable to read in all the data before the master kills the connection without seeing any data.
I'm hitting that issue on the version shipped by Debian (between 2.0.25 and 2.0.49).
I first tried something quick and dirty, replacing the string concatenation in the SpoolReader to a simple print here
https://github.com/munin-monitoring/munin/blob/5889ee366e6de473d52d862cf24008e8c651996c/lib/Munin/Node/SpoolReader.pm#L96-L98
, hoping that the data would be spat out in batches (I also added a select()->flush() to avoid buffering too much. It works when I connect manually, and I get data sooner, but it doesn't seem to impact the timeout on the master side.
I then dug into the master and found this timeout
https://github.com/munin-monitoring/munin/blob/stable-2.0/master/lib/Munin/Master/Node.pm#L647
Interestingly, the io_timeout is 120, but the munin-update runs always timeout at 180s. Even when setting the io_timeout to 1200, or hardcoding it there.
I added some debug, and realised that some of the plugins on the remote output a lot of data (up to 11MB), and that some of that data is received in a timely manner. The whole fetch, though, takes longer than some 180s timeout.
[ERROR] Error in node communication with node/ssh://[email protected]:22 -6:4949: [FATAL] Socket read timed out to node. Terminating process. at /usr/share/perl5/Munin/Master/UpdateWorker.pm line 381.
coming from here https://github.com/munin-monitoring/munin/blob/stable-2.0/master/lib/Munin/Master/UpdateWorker.pm#L381
So the timeout seems to be a Master issue, rather than some buffering issue on the Node/Async side.
It looks like _node_read_single has been reimplemented in master, and got rid of the do_with_timeout call.
https://github.com/munin-monitoring/munin/blob/master/lib/Munin/Master/Node.pm#L422-L458
I tried simply replacing the function altogether and... it works fine. Until it gets killed by the 180s timeout again. I suspect some old do_with_timeout doesn't cancel its own timeout properly. Or maybe it's just some other timeout somewhere in the config...
Ok, this seems to be capped by those two timeouts in the ProcessManager
https://github.com/munin-monitoring/munin/blob/stable-2.0/master/lib/Munin/Master/ProcessManager.pm#L41-L42
where the worker_timeout needs to be lower than the timeout. Unfortunately, they are not configurable.
Looks like master has gotten rid of this logic altogether, in favour of Parallel::ForkManager. https://github.com/munin-monitoring/munin/commit/5584ba3bea646b3c2d40a5469cad023691563e42
Looks like I'm gonna need to monkey-patch my local install to have higher timeouts.
That said, I'm confused about the versioning of Munin altogether. Debian and other distros still seem to be on stable-2.0, while master has been moving forwards for a while, but it's not available as packages. Are the versions and expectation documented somewhere?
I first tried something quick and dirty, replacing the string concatenation in the SpoolReader to a simple
Ok, don't do that, it's a terrible idea (:
The data gets all jumbled in confusing ways when running plugins in parallel :laughing:
Interestingly it should only send 5 samples per plugin as maximum due to d6bfe63c6d60922ae363db3c5a87e304916b9bde.
While I still haven't got to the bottom of this issue, I found a workaround for when munin gets stuck and can't get more data. The date of the last spoolfetch is stored in a Perl storable, for each monitored node.
Using storableedit, it is possible to edit those, and set the date to a more recent one. Munin then just gracefully picks up from there.
This is done (on Debianites) as follows (having dropped some version of storableedit in my
$PATH.
[23:53:54] @munin-master ~munin$ date +%s 130 ↵
1586181240
[23:54:00] @munin-master ~munin$ sudo -u munin `which storableedit-1.5.pl` /var/lib/munin/state-munin-node1.storable 0s
[sudo] password for shtrom:
Opening file: /var/lib/munin/state-munin-node1.storable
spoolfetch => 1586180925
value => HASH(0x5634139ea6f8)
/ (HASH)> $cur->{spoolfetch}=1586181240
/ (HASH)> x
File saved: /var/lib/munin/state-munin-node1.storable
Thanks for sharing your experiences and experiments!
I would summarize the situation as follows:
munin-asyncfails to transfer data, if the amount of data exceeds the currently configured timeouts (here 180s)- in this case
munin-asyncis stuck forever (trying again and again without success) storableeditcan be used as a manual workaround for such a situation
I can imagine the following approaches for fixing this issue:
- A) somehow switch to a chunk-based incremental transfer
- B) allow configurable timeouts or maybe adapt the timeout to the timestamp of the most recent transfer (older -> tolerate longer connections; but this would be just non-generic playing with numbers)
- C) follow the approach used in
master(in case it is better)
What do you think?
That's about it, but I'm not sure about the fails to transfer data. I haven't been able to confirm or disprove what was happening there. It just gets stuck, but I don't really know why. Moreover, my experiments above trying to remove the buffering did not actually seem to help, so I don't really know where the blocker is...
It seems that going with what master has should be the first port of call. Is this the approach you mentioned in https://github.com/munin-monitoring/munin/issues/619#issuecomment-554944924?
Workaround with storableedit didn't work for me, but rm /var/lib/munin-async/* did the trick.