omd OMD 5.11.20230318-labs-edition seems to freeze/block the livestatus socket

This is unfortunatley a little vague at the moment however it seems like when we put PG001 (host) into downtime on 5.11.20230318 naemon ends up locking up or getting broken in some way.

Under normal circumstances this returns this:

Every 2.0s: lsof /omd/sites/default/tmp/run/live                                                                                                                                                             OMD002: Wed Mar 29 10:43:51 2023
COMMAND    PID    USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
naemon  444013 default   12u  unix 0x000000003aced7f1      0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM
naemon  444027 default   12u  unix 0x000000003aced7f1      0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM

However when it is broken (i.e. thruk is timing out communicating with the socket) lsof shows:

OMD[default@OMD002]:~$ lsof /omd/sites/default/tmp/run/live
COMMAND    PID    USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
naemon  348464 default   12u  unix 0x00000000106e0b91      0t0 1394514 /omd/sites/default/tmp/run/live type=STREAM
naemon  348464 default   19u  unix 0x000000003ea4cdc5      0t0 1772423 /omd/sites/default/tmp/run/live type=STREAM
naemon  348477 default   12u  unix 0x00000000106e0b91      0t0 1394514 /omd/sites/default/tmp/run/live type=STREAM

Which looks the same naemon has spun up another file handle to the socket or something.

Thruk has the following errors;

[2023/03/29 10:31:06][OMD002][ERROR] 491: failed to connect - failed to connect to /omd/sites/default/tmp/run/live: Resource temporarily unavailable at /omd/sites/default/share/thruk/lib/Thruk/Backend/Manager.pm line 1631.
[2023/03/29 10:31:07][OMD002][ERROR] 491: failed to connect - failed to connect to /omd/sites/default/tmp/run/live: Resource temporarily unavailable at /omd/sites/default/share/thruk/lib/Thruk/Backend/Manager.pm line 1631.

There is nothing significant or that looks like errors in the naemon.log itself nor the livestatus.log.

Our resolution for the problem is:

killall -9 naemon; omd restart naemon

We are not convinced that the downtime action is actually what is causing it, it may just be that it has correllated with the event multiple times.

Mar 29 '23 09:03 infraweavers

Hmm,

We've just had this happen sporadically with just the two sockets showing in lsof:

naemon  444013 default   12u  unix 0x0000000086c5cff7      0t0 9310587 /omd/sites/default/tmp/run/live type=STREAM
naemon  444027 default   12u  unix 0x000000003aced7f1      0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM

strace of the two pids, one is very busy one is not:

@OMD002:~$ sudo strace --attach=444027
strace: Process 444027 attached
restart_syscall(<... resuming interrupted read ...>) = 0
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500^Cstrace: Process 444027 detached

Mar 29 '23 11:03 infraweavers

this might be linked to the recent changes in naemon comment/downtime handling but needs more investigation.

Mar 29 '23 11:03 sni

Yeah the other thread is absolutely hammering something like this and it looks like it's the same data over and over:

Mar 29 '23 11:03 infraweavers

this might be linked to the recent changes in naemon comment/downtime handling but needs more investigation.

Do you have any suggestions of how we can gather more information?

Mar 29 '23 11:03 infraweavers

Also, this really doesn't look right:

It looks like the retention file has all the downtimes/comment data duplicated many times ...

Mar 29 '23 11:03 infraweavers

there is something going wrong... i just updated the patch, since it wasn't the last version of that patch anyway. You could try tomorrows daily. Btw, this is the PR in question: https://github.com/naemon/naemon-core/pull/420

Mar 29 '23 11:03 sni

omd omd copied to clipboard

OMD 5.11.20230318-labs-edition seems to freeze/block the livestatus socket

omd
omd copied to clipboard