Thruk icon indicating copy to clipboard operation
Thruk copied to clipboard

Thruk crashes Nagios running ndo3 when passing commands

Open rickbrowne opened this issue 2 years ago • 5 comments

Describe the bug Whenever an acknowledgement is sent to a backend running NDO3 instead of ndo2db it kills the Nagiso service, with the following error in the log-

[1627565739] NDO-3: ndo_return = 1 (Commands out of sync; you can't run this command now) [1627565739] NDO-3: ndo_get_object_id_name2(ndo.c:1312): Unable to store results [1627565739] Caught SIGSEGV, shutting down... [1627565744] Caught SIGTERM, shutting down...

Thruk Version Version 2.38-2 NagiosXI 5.8.5

To Reproduce Steps to reproduce the behavior: connecting thruk to nagios server running ndo3 via livestatus acknowledge any alert

How does Thruk actually send acknowledgements to the backend nagios servers? does it write directly to the nagios.cmd file? or is there another method?

rickbrowne avatar Aug 19 '21 14:08 rickbrowne

Thruk sends commands via livestatus. The same way it fetches all its hosts and services. You probably should file a bug for nagios, i guess ndo3 is part of their product?

sni avatar Aug 19 '21 19:08 sni

Thanks for the swift reply

To quote NagiosXI team directly-

ndo2db is our older technology that basically listens on a UNIX socket for database inserts, then handles the actual insertion into the database. It has limits, being that it runs into issues when it tries to insert more than the database can handle. In newer versions (Nagios XI 5.7.0 and later), this was replaced by just writing directly to the database from the Nagios worker threads. In addition to being able to handle more database inserts, this resulted in an overall performance boost, too.

I can definitely see the advantage of that - since currently if we issue more than a few hundred downtimes in Thruk it can kill the Nagios service on some of our boxes

I've filed a bug with NagiosXI and they are asking for the command used so they can look at it further from their side

rickbrowne avatar Aug 20 '21 08:08 rickbrowne

The submited commands should be logged in your thruk.log logfile.

sni avatar Aug 20 '21 08:08 sni

ok cool, will send through to them

As a side note: have you any plans to find a replacement for livestatus? The last version was released a very long time ago and I'm pretty sure there is no support or interest in maintaining it since the creator released checkmk

rickbrowne avatar Aug 20 '21 08:08 rickbrowne

Right now there are no plans to add support for other connections besides livestatus. NDO does not work bi-directional, it does not allow you to send commands and its also super slow compared to livestatus. In fact, livestatus has been invented because ndo performs so poorly.

sni avatar Aug 24 '21 08:08 sni

Done some more work on this recently trying to get around the bug with NDO3

Running NagiosXI 5.9.3 and Thruk 3.04

On this server if I submit 20 acknowledgments it works fine, but if I go as high as 25 it crashes the backend with the same NDO3 error noted above

One thing I tried this time was taking the commands out of the thruk.log and manually adding them all to the nagios.cmd file - this works successfully, it is able to acknowledge 27 alerts at once

So the command structure and the time stamps etc are all fine, but for some reason when Thruk submits the same commands it gets "NDO-3: ndo_return = 1 (Commands out of sync; you can't run this command now)"

(Is this how livestatus does it? just by writting to the nagios.cmd file?)

rickbrowne avatar Apr 17 '23 13:04 rickbrowne

livestatus writes the commands into the query handler to avoid such side effects.

sni avatar Apr 17 '23 14:04 sni