munin icon indicating copy to clipboard operation
munin copied to clipboard

munin-async spoolfetch should send data earlier

Open madduck opened this issue 9 years ago • 9 comments
trafficstars

It seems that munin-async --spoolfetch on the client first reads all data and only then sends it over the wire. This is not only memory-intensive, it's also a problem when the reading takes longer than 180 seconds and the socket times out.

I just noticed this in the case of a particularly slow host which was cut off for three days, while munin-asyncd kept collecting. munin-async in this case is unable to read in all the data before the master kills the connection without seeing any data.

madduck avatar Jan 12 '16 09:01 madduck

I'm hitting that issue on the version shipped by Debian (between 2.0.25 and 2.0.49).

I first tried something quick and dirty, replacing the string concatenation in the SpoolReader to a simple print here https://github.com/munin-monitoring/munin/blob/5889ee366e6de473d52d862cf24008e8c651996c/lib/Munin/Node/SpoolReader.pm#L96-L98 , hoping that the data would be spat out in batches (I also added a select()->flush() to avoid buffering too much. It works when I connect manually, and I get data sooner, but it doesn't seem to impact the timeout on the master side.

I then dug into the master and found this timeout https://github.com/munin-monitoring/munin/blob/stable-2.0/master/lib/Munin/Master/Node.pm#L647 Interestingly, the io_timeout is 120, but the munin-update runs always timeout at 180s. Even when setting the io_timeout to 1200, or hardcoding it there.

I added some debug, and realised that some of the plugins on the remote output a lot of data (up to 11MB), and that some of that data is received in a timely manner. The whole fetch, though, takes longer than some 180s timeout.

[ERROR] Error in node communication with node/ssh://[email protected]:22 -6:4949: [FATAL] Socket read timed out to node.  Terminating process. at /usr/share/perl5/Munin/Master/UpdateWorker.pm line 381.

coming from here https://github.com/munin-monitoring/munin/blob/stable-2.0/master/lib/Munin/Master/UpdateWorker.pm#L381

So the timeout seems to be a Master issue, rather than some buffering issue on the Node/Async side.

shtrom avatar Oct 07 '19 09:10 shtrom

It looks like _node_read_single has been reimplemented in master, and got rid of the do_with_timeout call. https://github.com/munin-monitoring/munin/blob/master/lib/Munin/Master/Node.pm#L422-L458

I tried simply replacing the function altogether and... it works fine. Until it gets killed by the 180s timeout again. I suspect some old do_with_timeout doesn't cancel its own timeout properly. Or maybe it's just some other timeout somewhere in the config...

shtrom avatar Oct 07 '19 09:10 shtrom

Ok, this seems to be capped by those two timeouts in the ProcessManager https://github.com/munin-monitoring/munin/blob/stable-2.0/master/lib/Munin/Master/ProcessManager.pm#L41-L42 where the worker_timeout needs to be lower than the timeout. Unfortunately, they are not configurable.

Looks like master has gotten rid of this logic altogether, in favour of Parallel::ForkManager. https://github.com/munin-monitoring/munin/commit/5584ba3bea646b3c2d40a5469cad023691563e42

Looks like I'm gonna need to monkey-patch my local install to have higher timeouts.

That said, I'm confused about the versioning of Munin altogether. Debian and other distros still seem to be on stable-2.0, while master has been moving forwards for a while, but it's not available as packages. Are the versions and expectation documented somewhere?

shtrom avatar Oct 07 '19 10:10 shtrom

I first tried something quick and dirty, replacing the string concatenation in the SpoolReader to a simple print here https://github.com/munin-monitoring/munin/blob/5889ee366e6de473d52d862cf24008e8c651996c/lib/Munin/Node/SpoolReader.pm#L96-L98

Ok, don't do that, it's a terrible idea (:

The data gets all jumbled in confusing ways when running plugins in parallel :laughing:

shtrom avatar Oct 07 '19 11:10 shtrom

Interestingly it should only send 5 samples per plugin as maximum due to d6bfe63c6d60922ae363db3c5a87e304916b9bde.

steveschnepp avatar Nov 18 '19 10:11 steveschnepp

While I still haven't got to the bottom of this issue, I found a workaround for when munin gets stuck and can't get more data. The date of the last spoolfetch is stored in a Perl storable, for each monitored node.

Using storableedit, it is possible to edit those, and set the date to a more recent one. Munin then just gracefully picks up from there.

This is done (on Debianites) as follows (having dropped some version of storableedit in my $PATH.

[23:53:54] @munin-master ~munin$ date +%s                                                                                                                                                                                                 130 ↵
1586181240
[23:54:00] @munin-master ~munin$ sudo -u munin `which storableedit-1.5.pl` /var/lib/munin/state-munin-node1.storable                                                                                                               0s
[sudo] password for shtrom:
Opening file: /var/lib/munin/state-munin-node1.storable
  spoolfetch => 1586180925
  value => HASH(0x5634139ea6f8)

/ (HASH)> $cur->{spoolfetch}=1586181240

/ (HASH)> x
File saved: /var/lib/munin/state-munin-node1.storable

shtrom avatar Apr 06 '20 13:04 shtrom

Thanks for sharing your experiences and experiments!

I would summarize the situation as follows:

  • munin-async fails to transfer data, if the amount of data exceeds the currently configured timeouts (here 180s)
  • in this case munin-async is stuck forever (trying again and again without success)
  • storableedit can be used as a manual workaround for such a situation

I can imagine the following approaches for fixing this issue:

  • A) somehow switch to a chunk-based incremental transfer
  • B) allow configurable timeouts or maybe adapt the timeout to the timestamp of the most recent transfer (older -> tolerate longer connections; but this would be just non-generic playing with numbers)
  • C) follow the approach used in master (in case it is better)

What do you think?

sumpfralle avatar Apr 06 '20 15:04 sumpfralle

That's about it, but I'm not sure about the fails to transfer data. I haven't been able to confirm or disprove what was happening there. It just gets stuck, but I don't really know why. Moreover, my experiments above trying to remove the buffering did not actually seem to help, so I don't really know where the blocker is...

It seems that going with what master has should be the first port of call. Is this the approach you mentioned in https://github.com/munin-monitoring/munin/issues/619#issuecomment-554944924?

shtrom avatar Apr 07 '20 07:04 shtrom

Workaround with storableedit didn't work for me, but rm /var/lib/munin-async/* did the trick.

dimaslv avatar Jan 25 '22 20:01 dimaslv