shinken
shinken copied to clipboard
Poller is not picking up the Queued tasks, the Host and Service checks are getting timed out.
Hi All, I'm blocked due to failure with Shinken monitoring.., someone please help me on this... I can provide all information's and logs for sorting out this issue.
with OS upgrade, we are planning to run container based Shinken components for monitoring Windows 2008, 2012 and 2019 nodes. individual containers are been created for all the components say Arbiter, Broker, Poller, Scheduler...etc
Configured Webui and Thruk for UI
All the configurations are good, and connectivity is there, but I could notice that the Hosts and Service checks are getting timed out.., after been in Pending state for quite some time... moreover these checks are not executing on time as well...
Enabled Debug log level and I could see queued up tasks are not getting picked by Poller.. Could not understand the reason behind it..
But when I logged in to the Poller-container and executed the command manually it got executed instantly..
System Details
OS - Alma 9.3 CPU - 4 RAM - 8GB Shinken Version - 2.4.3
Even though the system has sufficient resources and only a few simple services are there to check, still the Poller service is consuming most of the CPU but the tasks are not getting done.
In the Poller log Queued up tasks are not getting picked up
[1710406970] DEBUG: [Shinken] Debug perf: ping [args:5.00679016113e-06] [aqu_lock:0.0][calling:3.09944152832e-06] [json:8.10623168945e-06] [global:1.62124633789e-05]
[1710406970] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406970] DEBUG: [Shinken] Debug perf: get_broks [args:4.00543212891e-05] [aqu_lock:0.000130891799927][calling:5.50746917725e-05] [json:7.86781311035e-06] [global:0.000233888626099]
[1710406971] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.00135803223e-05] [global:1.71661376953e-05]
[1710406971] DEBUG: [Shinken] Debug perf: get_external_commands [args:5.00679016113e-06] [aqu_lock:9.53674316406e-07][calling:2.40802764893e-05] [json:9.77516174316e-06] [global:3.981590271e-05]
[1710406971] DEBUG: [Shinken] Debug perf: ping [args:3.09944152832e-06] [aqu_lock:9.53674316406e-07][calling:1.90734863281e-06] [json:8.10623168945e-06] [global:1.4066696167e-05]
[1710406971] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406971] DEBUG: [Shinken] Debug perf: get_broks [args:5.48362731934e-05] [aqu_lock:0.0001540184021][calling:5.50746917725e-05] [json:1.00135803223e-05] [global:0.000273942947388]
[1710406972] DEBUG: [Shinken] ========================
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:0 (Queued:154 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:1 (Queued:0 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:2 (Queued:4 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:3 (Queued:16 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] I decide to up wait ratio
[1710406972] DEBUG: [Shinken] Wait ratio: 2.040918
[1710406972] DEBUG: [Shinken] [shinken-poller-1] Trying to adjust worker number. Actual number : 4, min per module : 4, max per module : 4
[1710406972] DEBUG: [Shinken] Ask actions to 0, got 0
[1710406972] DEBUG: [Shinken] Loop turn
[1710406972] DEBUG: [Shinken] Debug perf: ping [args:5.00679016113e-06] [aqu_lock:9.53674316406e-07][calling:3.09944152832e-06] [json:8.82148742676e-06] [global:1.78813934326e-05]
[1710406972] DEBUG: [Shinken] Debug perf: get_external_commands [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.59876251221e-05] [json:9.05990600586e-06] [global:4.00543212891e-05]
[1710406972] DEBUG: [Shinken] Debug perf: ping [args:3.81469726562e-06] [aqu_lock:0.0][calling:2.14576721191e-06] [json:1.78813934326e-05] [global:2.38418579102e-05]
[1710406972] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406972] DEBUG: [Shinken] Debug perf: get_broks [args:3.91006469727e-05] [aqu_lock:0.000120878219604][calling:5.50746917725e-05] [json:8.10623168945e-06] [global:0.000223159790039]
[1710406973] DEBUG: [Shinken] Debug perf: ping [args:2.86102294922e-06] [aqu_lock:7.15255737305e-06][calling:1.90734863281e-06] [json:1.00135803223e-05] [global:2.19345092773e-05]
[1710406973] DEBUG: [Shinken] Debug perf: get_external_commands [args:5.96046447754e-06] [aqu_lock:9.53674316406e-07][calling:3.09944152832e-05] [json:1.19209289551e-05] [global:4.98294830322e-05]
[1710406973] DEBUG: [Shinken] Debug perf: ping [args:3.81469726562e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.09672546387e-05] [global:1.78813934326e-05]
[1710406973] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406973] DEBUG: [Shinken] Debug perf: get_broks [args:5.07831573486e-05] [aqu_lock:0.000138998031616][calling:5.50746917725e-05] [json:1.00135803223e-05] [global:0.00025486946106]
[1710406974] DEBUG: [Shinken] ========================
Scheduler DEBUG logs
[1710407544] DEBUG: [Shinken] Debug perf: ping [args:5.00679016113e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.09672546387e-05] [global:1.90734863281e-05]
[1710407544] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710407544] DEBUG: [Shinken] Debug perf: get_broks [args:3.88622283936e-05] [aqu_lock:0.0001380443573][calling:5.19752502441e-05] [json:7.15255737305e-06] [global:0.000236034393311]
[1710407545] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:0.0
2024-03-14 05:12:27,044 DEBG 'shinken-scheduler' stdout output:
][calling:2.86102294922e-06] [json:9.05990600586e-06] [global:1.59740447998e-05]
[1710407545] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407545] DEBUG: [Shinken] Debug perf: get_checks [args:8.58306884766e-05] [aqu_lock:0.00012993812561][calling:4.50611114502e-05] [json:5.96046447754e-06] [global:0.000266790390015]
[1710407545] DEBUG: [Shinken] Debug perf: put_results [args:0.000179052352905] [aqu_lock:0.0][calling:1.4066696167e-05] [json:1.50203704834e-05] [global:0.000208139419556]
[1710407545] DEBUG: [Shinken] Load: (sleep) 1.00 (average: 1.00) -> 0%
[1710407545] DEBUG: [Shinken] Time to send 0 broks (after 0 secs)
[1710407545] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407545] DEBUG: [Shinken] Latency (avg/min/max): 1435.98/1.88/2870.07
[1710407545] DEBUG: [Shinken] Check average = 0 checks/s
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.86102294922e-06] [json:1.00135803223e-05] [global:1.78813934326e-05]
[1710407546] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710407546] DEBUG: [Shinken] Debug perf: get_broks [args:3.69548797607e-05] [aqu_lock:0.000128984451294][calling:4.81605529785e-05] [json:5.96046447754e-06] [global:0.000220060348511]
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.00135803223e-05] [global:1.71661376953e-05]
[1710407546] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407546] DEBUG: [Shinken] Debug perf: get_checks [args:7.89165496826e-05] [aqu_lock:0.000149011611938][calling:4.81605529785e-05] [json:6.91413879395e-06] [global:0.000283002853394]
[1710407546] DEBUG: [Shinken] Debug perf: put_results [args:0.000194072723389] [aqu_lock:9.53674316406e-07][calling:1.50203704834e-05] [json:1.59740447998e-05] [global:0.000226020812988]
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:1.90734863281e-06] [json:1.09672546387e-05] [global:1.78813934326e-05]
[1710407546] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407546] DEBUG: [Shinken] Debug perf: get_checks [args:9.67979431152e-05] [aqu_lock:0.0001380443573][calling:6.60419464111e-05] [json:8.10623168945e-06] [global:0.000308990478516]
[1710407546] DEBUG: [Shinken] Debug perf: put_results [args:0.000180959701538] [aqu_lock:0.0][calling:1.4066696167e-05] [json:1.50203704834e-05] [global:0.000210046768188]
[1710407546] DEBUG: [Shinken] Load: (sleep) 1.00 (average: 1.00) -> 0%
[1710407546] DEBUG: [Shinken] Time to send 0 broks (after 0 secs)
[1710407546] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407546] DEBUG: [Shinken] Latency (avg/min/max): 1435.98/1.88/2870.07
[1710407546] DEBUG: [Shinken] Check average = 0 checks/s
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:2.86102294922e-06] [aqu_lock:1.19209289551e-06][calling:2.86102294922e-06] [json:1.09672546387e-05] [global:1.78813934326e-05]
[1710407546] DEBUG: [Shinken] The arbiter asked me what I manage. It's {0: 128736}
[1710407546] DEBUG: [Shinken] Debug perf: what_i_managed [args:3.81469726562e-06] [aqu_lock:1.19209289551e-06][calling:0.000140905380249] [json:2.09808349609e-05] [global:0.000166893005371]
[1710407547] DEBUG: [Shinken] Debug perf: ping [args:6.91413879395e-06] [aqu_lock:1.90734863281e-06][calling:3.09944152832e-06] [json:1.78813934326e-05] [global:2.98023223877e-05]
[1710407547] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407547] DEBUG: [Shinken] Debug perf: get_checks [args:8.70227813721e-05] [aqu_lock:0.000164985656738][calling:4.81605529785e-05] [json:5.96046447754e-06] [global:0.000306129455566]
[1710407547] DEBUG: [Shinken] Debug perf: ping [args:4.
In the Scheduler log I could see these WARNING
docker logs shinken-scheduler-1 |grep WARNING
[1710405731] WARNING: [Shinken] 4 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405791] WARNING: [Shinken] 4 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405851] WARNING: [Shinken] 4 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405911] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405971] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406031] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406091] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406151] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406211] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406932] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406992] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710407052] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710407112] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710407173] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
Scheduler log
docker logs shinken-scheduler-1 |grep 'Checks: total'
[1710407422] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407423] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407424] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407425] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407426] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407427] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[root@localhost docker-shinken]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3ab294d19b53 thews/shinken-receiver:latest "/bin/sh -c 'supervi…" 2 hours ago Up 2 hours 7773/tcp shinken-receiver-1
cd1b61bf032e thews/shinken-thruk:latest "/bin/sh -c 'supervi…" 2 hours ago Up 2 hours 0.0.0.0:8000->80/tcp, :::8000->80/tcp docker-shinken-shinken-thruk-1
42b05cd36e91 thews/shinken-poller:latest "/bin/sh -c 'supervi…" 2 hours ago Up 2 hours 7771/tcp shinken-poller-1
cbf7ae4065a8 thews/shinken-arbiter:latest "/bin/sh -c 'supervi…" 2 hours ago Up 2 hours 7770/tcp shinken-arbiter-1
33d919544257 thews/shinken-reactionner:latest "/bin/sh -c 'supervi…" 2 hours ago Up 2 hours 7769/tcp shinken-reactionner-1
1b4e9f25fcec thews/shinken-scheduler:latest "/bin/sh -c 'supervi…" 2 hours ago Up 2 hours 7768/tcp shinken-scheduler-1
b91b9d835c0a thews/shinken-broker:latest "/bin/sh -c 'supervi…" 2 hours ago Up 2 hours 7772/tcp, 50000/tcp, 0.0.0.0:8001->7767/tcp, :::8001->7767/tcp shinken-broker-1
docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
3ab294d19b53 shinken-receiver-1 0.80% 42.84MiB / 7.493GiB 0.56% 46.4MB / 42.4MB 0B / 62.7MB 6
cd1b61bf032e docker-shinken-shinken-thruk-1 0.02% 133.7MiB / 7.493GiB 1.74% 301kB / 839kB 0B / 1.2MB 16
42b05cd36e91 shinken-poller-1 390.03% 75.05MiB / 7.493GiB 0.98% 52.3MB / 50.8MB 0B / 66.4MB 51
cbf7ae4065a8 shinken-arbiter-1 0.51% 45.05MiB / 7.493GiB 0.59% 94.3MB / 107MB 0B / 1.77MB 8
33d919544257 shinken-reactionner-1 0.96% 49.39MiB / 7.493GiB 0.64% 64.4MB / 68MB 0B / 71.3MB 10
1b4e9f25fcec shinken-scheduler-1 0.74% 44.74MiB / 7.493GiB 0.58% 70.4MB / 59.4MB 0B / 86.3MB 6
b91b9d835c0a shinken-broker-1 1.19% 77.4MiB / 7.493GiB 1.01% 110MB / 109MB 0B / 76.7MB 13
@geektophe Could you please help me on this.. ;)
I could locate the issue as in, the docker is not support multiprocessing... Poller uses multiprocessing for running worker.. need to figure out how to mitigate this
Workers are running, but work.start() is getting stuck.. from there on no progress