omd OMD 5.10 shows very different gearman worker status

Hello,

So since our OMD 4.40 -> OMD 5.10 upgrade we've been experiencing occasions where our gearman server appears to have large numbers of running or waiting checks. On investigation we can see that the behaviour of service checks through gearman is very different under OMD 5.10. In order to do some diagnostics we've downgraded one of our OMD boxes to OMD 4.60; but we have "transplanted" the version of mod_gearman_worker-go and the epn into the 4.60 box so we're not running into https://github.com/ConSol/mod-gearman-worker-go/issues/19 This has the added benefit of exonerating mod_gearman_worker-go which is nice. I'm leaning towards there being a change in naemon-core.

OMD config:

    omd config set GEARMAND on
    omd config set GEARMAND_PORT 0.0.0.0:4730
    omd config set GEARMAN_WORKER on
    omd config set LIVESTATUS_TCP on
    omd config set LIVESTATUS_TCP_PORT 6557
    omd config set MOD_GEARMAN on
    omd config set PNP4NAGIOS gearman
    omd config set THRUK_COOKIE_AUTH off
    omd config set GRAFANA on

Graph of /omd/sites/default/lib/monitoring-plugins/check_gearman -H OMD101.man.cwserverfarm.local -W 501 -C 750 -w 501 -c 750 where we can see the differing behaviour.

Mar 02 '23 08:03 infraweavers

The Load Average (not that it means much) is also significantly higher under 5.10.

I'll keep digging and see what else shows up. We did notice that the core scheduling graph also looks "wierd" under 5.10 compared to 4.60 (like much spiker and not as even etc) however it's difficult to get a side-by-side comparison on that. I'll see what turns up

Mar 02 '23 10:03 infraweavers

try disabling embedded perl in the etc/mod-gearman/worker.cfg. I noticed an issue yesterday in the epn connector if the plugin output exceeds 8kb.

Mar 02 '23 11:03 sni

try disabling embedded perl in the etc/mod-gearman/worker.cfg. I noticed an issue yesterday in the epn connector if the plugin output exceeds 8kb.

Cool, we'll give that a shot on an un-touched 5.10

Mar 02 '23 15:03 infraweavers

yeah, but wait till tomorrow, still working on that fix.

Mar 02 '23 15:03 sni

Hmm, I disabled embedded perl yesterday (about where the red line is); can't really see a difference so far:

Mar 03 '23 07:03 infraweavers

todays daily looks fine. epn should run much smoother now.

Mar 03 '23 08:03 sni

Cool, I'll build one of our boxes onto that and give it a test

Mar 03 '23 08:03 infraweavers

Hmm, I would say it doesn't look massively different at "big scale":

On the 1 week scale you can see where we upgraded to the nightly build (red line), it does arguably look a little bit better maybe?

Mar 06 '23 12:03 infraweavers

OK so we've downgraded one of them to OMD4.60 as well to see if we can narrow it down. It looks like the change in behaviour is between 4.60 and 5.10

Mar 09 '23 14:03 infraweavers

could you try the latest OMD daily, it should work quite well now. I also added something in the gearman neb module to flatten out the number of concurrent started checks.

Mar 10 '23 16:03 sni

could you try the latest OMD daily, it should work quite well now. I also added something in the gearman neb module to flatten out the number of concurrent started checks.

Yep we'll do that on Monday

Mar 10 '23 17:03 infraweavers

We've just rolled out omd-5.11.20230314-labs-edition onto one of the servers to test that now

Mar 14 '23 13:03 infraweavers

So from what we can see, it seems to be improved but not really back to where it was in 4.60. I think we will have to increase the workers to see if that will remove some of the noise and pressure that we're seeing. We do also keep getting pnp4nagios errors with the interval being too short between updates (similiar to https://github.com/ConSol/omd/pull/156 but for other checks, we have decreased the pnp_gearman_worker down to 1 to eliminate a race condition there and it still does it, so we're thinking that something is running the same check back-to-back as it were).

This sort of feels to us that check's aren't being run at regular intervals under 5+ (most of our checks are once per minute). We're going to investigate if we have evidence to support that assertion, but it certainly feels like that's what's going on.

Mar 17 '23 09:03 infraweavers

SO we looked into the naemon suspicions there and have found absolutely no evidence to support the idea that checks are being run more frequently than they should be. So, we have bumped our thresholds up from 500 to 2500 for the time being whilst we try and ascertain if the change is actually a problem for gearman/OMD etc or not

Mar 17 '23 16:03 infraweavers

btw, load average might seem to increase if you use the check_load scaled by cpu mode. The check_load now has a scaled_load perf counter and the previous "scaled" metric is the absolute unit now. So it might be, that the cpu usage did not increase at all, but the check_load check now reports different numbers.

Jun 22 '23 14:06 sni

omd omd copied to clipboard

OMD 5.10 shows very different gearman worker status

omd
omd copied to clipboard