check_yum icon indicating copy to clipboard operation
check_yum copied to clipboard

check_yum acummulating processes

Open calestyo opened this issue 9 years ago • 5 comments

A user reports via mail: hi! we had an issue with check_yum spawning yum processes without killing them when yum is stuck waiting for a lock. to see this happen, run

yum install isdn4k-utils

(or some other package not installed already) and do NOT answer yes, but leave it waiting.

the timeout code will kill the python script itself after 55 seconds, but the child process will be left behind. we had a server dying due to the lack of memory after a while, since Icinga runs the check every 5 minutes when it is in non-OK state...

the included patch will simply disable check_yum's own signal handler for SIGALRM and then proceed to send SIGALRM to all processes in its process group. this will include the forked nrpe parent, but not nrpe itself. when run interactively in a shell without job control, it may also terminate that interactive shell. I don't think it is worthwhile to complicate the code to avoid that behaviour.

also, have you heard that Google Code is shutting down? would be good to migrate your project to Github or similar. if you have already done so, please update

http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_yum/details

thanks!

calestyo avatar Apr 10 '15 03:04 calestyo

And he sent:

Index: modules/nagios/files/plugins/check_yum_updates
===================================================================
--- modules/nagios/files/plugins/check_yum_updates  (revision 108477)
+++ modules/nagios/files/plugins/check_yum_updates  (revision 108478)
@@ -198,7 +198,11 @@

    def timeout_signal_handler(self, signum, frame):
        """Function to be called by signal.alarm to kill the plugin."""
-       
+
+                # Send SIGALRM to all other processes in process group.
+       signal.signal(signal.SIGALRM, signal.SIG_IGN)
+       os.kill(0, signal.SIGALRM)
+
        end(UNKNOWN, "YUM nagios plugin has self terminated after exceeding the timeout (%s seconds)" % self.timeout)

calestyo avatar Apr 10 '15 04:04 calestyo

Now first to the issue itself: I've tried that but cannot reproduce it. For me (SL 6.6) even when yum install waits for Y/N check_yum runs just through normally.

How exactly do you invoke check_yum? And as which user?

When you look at issue #7, I mention something that upstream implemented (--setopt=exit_on_lock=true) and which could help us with locking issues... but I haven't had the mood so far to revisit this,... and it has the problem that if we simply exit then we cannot use "OK"... and OTOH we don't want to have non-OK statuses all the time, just because yum exited because of a lock.

calestyo avatar Apr 10 '15 04:04 calestyo

(Oh I just could reproduce your issue,... but it only happens, when I run check_yum as root)

calestyo avatar Apr 10 '15 04:04 calestyo

The to your patch:

  1. I'd probably prefer to simply remove the timeout code from check_yum at all. I mean we have plenty of other ways to set a timeout, Icinga/Nagios already have their timeous, there is the timeout(1) program. Users should IMHO simply use the standard GNU tools for that.

  2. I'm a bit reluctant of doing this.... as far as I understand the process group could also comprise further programs that in turn invoke check_yum, and we shouldn't kill those... if at all we should only kill our children!?

calestyo avatar Apr 10 '15 04:04 calestyo

Last but not least: yes I read that Google Code shuts down when it became public,... I've also started the migration but it always failed and I've opened a ticket at google. Apparently they've done it in the mean time again and the issues were migrated as well. So I've moved now all references to this site and marked the Google code site as closed.

calestyo avatar Apr 10 '15 04:04 calestyo