mod_gearman
mod_gearman copied to clipboard
sending job to gearmand failed since 5.1.2 (still OK in 5.1.0)
Hi,
Could there be a regression between 5.1.0 and 5.1.2 in submitting jobs to gearmand?
I'm using a clean Debian 11 install with the packages from OBS (as you recommend). This is from /var/log/mod-gearman/mod_gearman_neb.log with debug=4
[...]
[2023-08-09 01:15:37][26182][TRACE] move_results_to_core()
[2023-08-09 01:15:37][26182][TRACE] handle_svc_check(6, data)
[2023-08-09 01:15:37][26182][TRACE] handle_perfdata(6)
[2023-08-09 01:15:37][26182][TRACE] handle_svc_check(6, data)
[2023-08-09 01:15:37][26182][TRACE] got target queue from service custom variable: servicegroup_somegroup
[2023-08-09 01:15:37][26182][DEBUG] received job for queue servicegroup_somegroup: some.host.tld - MEMUSED, check_options: 0 latency so far: 0.286s
[2023-08-09 01:15:37][26182][TRACE] cmd_line: /usr/local/mon/libexec/check_nrpe -H 1.2.3.4 -t "15" -p "7070" -c check-memused -a "-" "-" "-" "95" "97"
[2023-08-09 01:15:37][26182][TRACE] add_job_to_queue(servicegroup_somegroup, F5699DEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEF, 1, 1, 1, 0, 60)
[2023-08-09 01:15:37][26182][TRACE] 280 --->type=service
result_queue=check_results
host_name=some.host.tld
service_description=MEMUSED
core_time=1691536537.000288
timeout=120
command_line=/usr/local/mon/libexec/check_nrpe -H 1.2.3.4 -t "15" -p "7070" -c check-memused -a "-" "-" "-" "95" "97"
<---
[2023-08-09 01:15:37][26182][TRACE] 384 +++>
12Py+XD5K<REDACTED>iW
<+++
[2023-08-09 01:15:42][26182][TRACE] create_client()
[2023-08-09 01:15:42][26182][INFO ] add_job_to_queue() retrying... 0
[2023-08-09 01:15:42][26182][TRACE] add_job_to_queue(servicegroup_somegroup, F5699DEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEF, 1, 0, 1, 0, 60)
[2023-08-09 01:15:42][26182][TRACE] 280 --->type=service
result_queue=check_results
host_name=some.host.tld
service_description=MEMUSED
core_time=1691536537.000288
timeout=120
command_line=/usr/local/mon/libexec/check_nrpe -H 1.2.3.4 -t "15" -p "7070" -c check-memused -a "-" "-" "-" "95" "97"
<---
[2023-08-09 01:15:42][26182][TRACE] 384 +++>
12Py+XD5K<REDACTED>iW
<+++
[2023-08-09 01:15:47][26182][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(26182)
[2023-08-09 01:15:47][26182][TRACE] create_client()
[2023-08-09 01:15:47][26182][TRACE] add_job_to_queue() finished with errors: 0
[2023-08-09 01:15:47][26182][ERROR] failed to send service check to gearmand
[2023-08-09 01:15:47][26182][TRACE] move_results_to_core()
[2023-08-09 01:15:47][26182][TRACE] handle_progam_status_data_events(11, data)
[2023-08-09 01:15:47][26182][TRACE] log file rotated: 0
[2023-08-09 01:15:47][26182][TRACE] handle_svc_check(6, data)
[2023-08-09 01:15:47][26182][TRACE] handle_perfdata(6)
[...]
By chance I had the 5.1.0 .deb still in the /var/cache/apt of a different testing VM. (note: this deb is from the labs.consol.de repository as I just recently switched, not sure if that matters).
If I dpkg -i the 5.1.0 one without changing my conffiles, it works again:
[...]
[2023-08-09 15:30:03][36755][TRACE] move_results_to_core()
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check(6, data)
[2023-08-09 15:30:03][36755][TRACE] handle_perfdata(6)
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check(6, data)
[2023-08-09 15:30:03][36755][TRACE] got target queue from service custom variable: servicegroup_somegroup
[2023-08-09 15:30:03][36755][DEBUG] received job for queue servicegroup_somegroup: some.host.tld - SSH, check_options: 0 latency so far: 0.513s
[2023-08-09 15:30:03][36755][TRACE] cmd_line: /usr/local/mon/libexec/check_ssh -t "10" -p "22" 1.2.3.4
[2023-08-09 15:30:03][36755][TRACE] add_job_to_queue(servicegroup_somegroup, 0D169733BEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFE68, 1, 1, 1, 1, 60)
[2023-08-09 15:30:03][36755][TRACE] 227 --->type=service
result_queue=check_results
host_name=some.host.tld
service_description=SSH
core_time=1691587803.000045
timeout=120
command_line=/usr/local/mon/libexec/check_ssh -t "10" -p "22" 1.2.3.4
<---
[2023-08-09 15:30:03][36755][TRACE] 320 +++>
12Py+XD5Kr8<REDACTED>kPddZBa552c6cpucUT0
<+++
[2023-08-09 15:30:03][36755][INFO ] gearmand submission statistics: jobs: 3 errors: 0 submit_rate: 0.0/s avg_submit_duration: 0.000098s max_submit_duration: 0.000145s
[2023-08-09 15:30:03][36755][TRACE] add_job_to_queue() finished successfully
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check() finished successfully
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check() finished successfully -> 206
[2023-08-09 15:30:03][36755][TRACE] got result H:com-dmcore001:47
[2023-08-09 15:30:03][36755][TRACE] 492 +++>
O8eJdV+43h55e<REDACTED>zXhzaSc=
<+++
[2023-08-09 15:30:03][36755][TRACE] 368 --->
type=active
host_name=some.host.tld
service_description=SSH
core_start_time=1691587803.000045
start_time=1691587804.909941
finish_time=1691587804.964980
return_code=0
exited_ok=1
source=Mod-Gearman Worker @ some-worker.tld
output=SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) | time=0.053162s;;;0.000000;10.000000
<---
[2023-08-09 15:30:03][36755][DEBUG] service job completed: some.host.tld SSH: exit 0, latency: 0.518, exec_time: 0.055
[2023-08-09 15:30:04][36755][TRACE] move_results_to_core()
[...]
So, for the record, this is a working combination.
# dpkg -l |grep -e naemon -e gearm
ii gearman-job-server 1.1.19.1+ds-2+b2 amd64 Job server for the Gearman distributed job queue
ii gearman-tools 1.1.19.1+ds-2+b2 amd64 Tools for the Gearman distributed job queue
ii libgearman8:amd64 1.1.19.1+ds-2+b2 amd64 Library providing Gearman client and worker functions
ii libnaemon:amd64 1.4.1-1 amd64 Library for Naemon - common data files
ii mod-gearman-module 5.1.0 amd64 Event broker module to distribute service checks.
ii mod-gearman-tools 5.1.2-1 amd64 Tools for mod-gearman
ii naemon-core 1.4.1-1 amd64 host/service/network monitoring and management system
ii naemon-livestatus 1.4.1-1 amd64 contains the Naemon livestatus eventbroker module
Same on Debian 12 'bookworm' with gearman-job-server 1.1.20 by the way (which is where I noticed the issue first).
I did some manual compiles. 5.1 was still fine for me, 5.1.1 was broken. The problem starts right after this commit:
https://github.com/sni/mod_gearman/commit/87e22207c67e1e3305362b46e183a3130e8925ae
Does it fail immediately?
Yes. 0 jobs arrive in gearmand.
FYI I did a reinstall on Debian 12 today using
ii naemon-core 1.4.2-1 amd64 host/service/network monitoring and management system
ii mod-gearman-module 5.1.2-1 amd64 Event broker module to distribute service checks.
and I got the same result.
I was keeping an old copy of the 5.1 binary to work around it, I'll have to recompile for the new v6 event broker API.
Error: Module '/usr/lib/x86_64-linux-gnu/mod_gearman/mod_gearman_naemon.o' is using an incompatible version (v5) of the event broker API (current version: v6). Module will be unloaded.
Or should I look into something else that could be wrongly configured on my system, as I find it hard to believe no-one else has been using mod-gearman-module on Debian the last few months. :thinking:
or those packages from OBS? I assumed they automatically rebuild if there is an update in naemon-core.
I'm using
$ cat /etc/apt/sources.list.d/naemon-stable.list
deb [signed-by=/etc/apt/trusted.gpg.d/naemon.asc] http://download.opensuse.org/repositories/home:/naemon/Debian_12/ ./
apt show confirms it.
i will look into this. There was a new naemon release some days ago, i assume that's the issue here. NEB modules have to be rebuild against that version. I will trigger a rebuild for mod-gearman.
The build is fine for all I know, there's no NEB error starting naemon after installing. But using those packages still results in the problem described in the issue here: check jobs don't arrive in gearmand.
I only get that NEB v6 error when I try to replace /usr/lib/x86_64-linux-gnu/mod_gearman/mod_gearman_naemon.o with an older 5.1 version that still worked for me on Debian 12 with gearman-job-server 1.1.20+ds-1.
In other words: the issue is still the same with naemon 1.4.2 and the 5.1.2-1 version of mod-gearman-module on Debian 12, fetched from download.opensuse.org/repositories/home:/naemon/Debian_12
At least for me :slightly_smiling_face:
I find it hard to believe no-one else has been using mod-gearman-module on Debian
I use it on several debian 12 machines, but i am using it with OMD-Labs which comes with its own gearmand. Had no issues so far.
I'm currently digging a bit deeper as I found that a plain vanilla install of all packages results in a working configuration.
Still a bit at a loss though as in the failing environment I now have identical naemon-core mod-gearman-module packages with identical
/etc/naemon/naemon.cfg
/etc/naemon/module-conf.d/mod-gearman.cfg
/etc/mod-gearman/module.conf
but no fix. I'll update this issue as soon as I find out more.
Turns out it's because I've been running gearmand for ages with --listen=0.0.0.0 to let remote mod-gearman-worker instances connect.
So I had no reason to suspect anything wrong with the gearmand setup (that I had not touched for a long time) and the mod-gearman-module suddenly stopped working properly with version 5.1.2 / this commit: https://github.com/sni/mod_gearman/commit/87e22207c67e1e3305362b46e183a3130e8925ae
I.e. these errors appeared out of nowhere:
sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available
But by default in the Debian gearman-job-server package, gearmand is started with --listen=localhost, and when I revert to that, version 5.1.2 starts working properly again.
What had me puzzled is that the check_results queue was created and had naemon in it. But the host/notification/service queues etc. were never created because that part of mod-gearman-module 5.1.2 apparently no longer works if gearmand listens on 0.0.0.0 instead of localhost .
5.1.0 still worked with gearmand on 0.0.0.0 though. :shrug:
Anyway, the fix for me is to just remove the --listen=... thing, by default it listens on everything then.
Sorry to bother you, maybe this helps someone else later :slightly_smiling_face: