omd
omd copied to clipboard
OMD 5.10 shows very different gearman worker status
Hello,
So since our OMD 4.40 -> OMD 5.10 upgrade we've been experiencing occasions where our gearman server appears to have large numbers of running or waiting checks. On investigation we can see that the behaviour of service checks through gearman is very different under OMD 5.10. In order to do some diagnostics we've downgraded one of our OMD boxes to OMD 4.60; but we have "transplanted" the version of mod_gearman_worker-go and the epn into the 4.60 box so we're not running into https://github.com/ConSol/mod-gearman-worker-go/issues/19 This has the added benefit of exonerating mod_gearman_worker-go which is nice. I'm leaning towards there being a change in naemon-core.
OMD config:
omd config set GEARMAND on
omd config set GEARMAND_PORT 0.0.0.0:4730
omd config set GEARMAN_WORKER on
omd config set LIVESTATUS_TCP on
omd config set LIVESTATUS_TCP_PORT 6557
omd config set MOD_GEARMAN on
omd config set PNP4NAGIOS gearman
omd config set THRUK_COOKIE_AUTH off
omd config set GRAFANA on
Graph of /omd/sites/default/lib/monitoring-plugins/check_gearman -H OMD101.man.cwserverfarm.local -W 501 -C 750 -w 501 -c 750 where we can see the differing behaviour.

The Load Average (not that it means much) is also significantly higher under 5.10.

I'll keep digging and see what else shows up. We did notice that the core scheduling graph also looks "wierd" under 5.10 compared to 4.60 (like much spiker and not as even etc) however it's difficult to get a side-by-side comparison on that. I'll see what turns up
try disabling embedded perl in the etc/mod-gearman/worker.cfg. I noticed an issue yesterday in the epn connector if the plugin output exceeds 8kb.
try disabling embedded perl in the etc/mod-gearman/worker.cfg. I noticed an issue yesterday in the epn connector if the plugin output exceeds 8kb.
Cool, we'll give that a shot on an un-touched 5.10
yeah, but wait till tomorrow, still working on that fix.
Hmm, I disabled embedded perl yesterday (about where the red line is); can't really see a difference so far:

todays daily looks fine. epn should run much smoother now.
Cool, I'll build one of our boxes onto that and give it a test
Hmm, I would say it doesn't look massively different at "big scale":

On the 1 week scale you can see where we upgraded to the nightly build (red line), it does arguably look a little bit better maybe?

OK so we've downgraded one of them to OMD4.60 as well to see if we can narrow it down. It looks like the change in behaviour is between 4.60 and 5.10

could you try the latest OMD daily, it should work quite well now. I also added something in the gearman neb module to flatten out the number of concurrent started checks.
could you try the latest OMD daily, it should work quite well now. I also added something in the gearman neb module to flatten out the number of concurrent started checks.
Yep we'll do that on Monday
We've just rolled out omd-5.11.20230314-labs-edition onto one of the servers to test that now
So from what we can see, it seems to be improved but not really back to where it was in 4.60. I think we will have to increase the workers to see if that will remove some of the noise and pressure that we're seeing. We do also keep getting pnp4nagios errors with the interval being too short between updates (similiar to https://github.com/ConSol/omd/pull/156 but for other checks, we have decreased the pnp_gearman_worker down to 1 to eliminate a race condition there and it still does it, so we're thinking that something is running the same check back-to-back as it were).
This sort of feels to us that check's aren't being run at regular intervals under 5+ (most of our checks are once per minute). We're going to investigate if we have evidence to support that assertion, but it certainly feels like that's what's going on.
SO we looked into the naemon suspicions there and have found absolutely no evidence to support the idea that checks are being run more frequently than they should be. So, we have bumped our thresholds up from 500 to 2500 for the time being whilst we try and ascertain if the change is actually a problem for gearman/OMD etc or not
btw, load average might seem to increase if you use the check_load scaled by cpu mode. The check_load now has a scaled_load perf counter and the previous "scaled" metric is the absolute unit now. So it might be, that the cpu usage did not increase at all, but the check_load check now reports different numbers.