omd
omd copied to clipboard
OMD 5.11.20230318-labs-edition seems to freeze/block the livestatus socket
This is unfortunatley a little vague at the moment however it seems like when we put PG001 (host) into downtime on 5.11.20230318 naemon ends up locking up or getting broken in some way.
Under normal circumstances this returns this:
Every 2.0s: lsof /omd/sites/default/tmp/run/live OMD002: Wed Mar 29 10:43:51 2023
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
naemon 444013 default 12u unix 0x000000003aced7f1 0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM
naemon 444027 default 12u unix 0x000000003aced7f1 0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM
However when it is broken (i.e. thruk is timing out communicating with the socket) lsof shows:
OMD[default@OMD002]:~$ lsof /omd/sites/default/tmp/run/live
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
naemon 348464 default 12u unix 0x00000000106e0b91 0t0 1394514 /omd/sites/default/tmp/run/live type=STREAM
naemon 348464 default 19u unix 0x000000003ea4cdc5 0t0 1772423 /omd/sites/default/tmp/run/live type=STREAM
naemon 348477 default 12u unix 0x00000000106e0b91 0t0 1394514 /omd/sites/default/tmp/run/live type=STREAM
Which looks the same naemon has spun up another file handle to the socket or something.
Thruk has the following errors;
[2023/03/29 10:31:06][OMD002][ERROR] 491: failed to connect - failed to connect to /omd/sites/default/tmp/run/live: Resource temporarily unavailable at /omd/sites/default/share/thruk/lib/Thruk/Backend/Manager.pm line 1631.
[2023/03/29 10:31:07][OMD002][ERROR] 491: failed to connect - failed to connect to /omd/sites/default/tmp/run/live: Resource temporarily unavailable at /omd/sites/default/share/thruk/lib/Thruk/Backend/Manager.pm line 1631.
There is nothing significant or that looks like errors in the naemon.log itself nor the livestatus.log.
Our resolution for the problem is:
killall -9 naemon; omd restart naemon
We are not convinced that the downtime action is actually what is causing it, it may just be that it has correllated with the event multiple times.
Hmm,
We've just had this happen sporadically with just the two sockets showing in lsof:
naemon 444013 default 12u unix 0x0000000086c5cff7 0t0 9310587 /omd/sites/default/tmp/run/live type=STREAM
naemon 444027 default 12u unix 0x000000003aced7f1 0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM
strace of the two pids, one is very busy one is not:
@OMD002:~$ sudo strace --attach=444027
strace: Process 444027 attached
restart_syscall(<... resuming interrupted read ...>) = 0
kill(444013, 0) = 0
poll([{fd=13, events=POLLIN}], 1, 500) = 0 (Timeout)
kill(444013, 0) = 0
poll([{fd=13, events=POLLIN}], 1, 500) = 0 (Timeout)
kill(444013, 0) = 0
poll([{fd=13, events=POLLIN}], 1, 500) = 0 (Timeout)
kill(444013, 0) = 0
poll([{fd=13, events=POLLIN}], 1, 500) = 0 (Timeout)
kill(444013, 0) = 0
poll([{fd=13, events=POLLIN}], 1, 500) = 0 (Timeout)
kill(444013, 0) = 0
poll([{fd=13, events=POLLIN}], 1, 500) = 0 (Timeout)
kill(444013, 0) = 0
poll([{fd=13, events=POLLIN}], 1, 500) = 0 (Timeout)
kill(444013, 0) = 0
poll([{fd=13, events=POLLIN}], 1, 500^Cstrace: Process 444027 detached
this might be linked to the recent changes in naemon comment/downtime handling but needs more investigation.
Yeah the other thread is absolutely hammering something like this and it looks like it's the same data over and over:

this might be linked to the recent changes in naemon comment/downtime handling but needs more investigation.
Do you have any suggestions of how we can gather more information?
Also, this really doesn't look right:

It looks like the retention file has all the downtimes/comment data duplicated many times ...
there is something going wrong... i just updated the patch, since it wasn't the last version of that patch anyway. You could try tomorrows daily. Btw, this is the PR in question: https://github.com/naemon/naemon-core/pull/420