docker icon indicating copy to clipboard operation
docker copied to clipboard

Dispatcher watchdog is always disabled

Open haydenseitz opened this issue 3 years ago • 5 comments

Behaviour

Dispatcher watchdog service (service_watchdog_enabled) is disabled in the local config

Steps to reproduce this issue

  1. Set service_watchdog_enabled to enabled in the global config
  2. Start container with dispatcher enabled
  3. service_watchdog_enabled is set to False

Expected behaviour

To follow the configured service_watchdog_enabled setting.

Is there a reason the dispatcher watchdog should be disabled in the container? I'm seeing an issue that my dispatchers are losing connection with my redis container, and the dispatcher completely stops polling. I'm thinking the watchdog would help with this, but wondering if there would be a bigger impact

haydenseitz avatar Dec 08 '20 20:12 haydenseitz

@haydenseitz Watchdog scheduler is disabled for the Docker image because polling service is already handled by Docker itself.

crazy-max avatar Dec 10 '20 02:12 crazy-max

Service recovery is not handled by docker, unless there's a container health check. When the poller disconnects from redis,the polling threads die, but the main dispatcher thread stays alive. The result is a "healthy" container that stops polling.

Thoughts on a health check to verify if the service is still polling? If not I can submit PR to enable watchdog

haydenseitz avatar Dec 10 '20 03:12 haydenseitz

@haydenseitz

When the poller disconnects from redis,the polling threads die, and the main dispatcher thread stays alive.

Ok then that's an issue with the dispatcher service itself.

If not I can submit PR to enable watchdog

I don't think watchdog is the proper way to handle this for the Docker image as it relies on log file. A Docker healthcheck instruction for the dispatcher service would be the right enhancement for this I think.

I'm seeing an issue that my dispatchers are losing connection with my redis container, and the dispatcher completely stops polling.

Do you have some logs?

crazy-max avatar Dec 10 '20 03:12 crazy-max

Is there an existing HTTP endpoint on the dispatcher we can hit to see if it's healthy? I'm running librenms in kubernetes and would like an HTTP endpoint i can hit to see if it's healthy, and restart if not; or a command I can run to check if it's healthy.

chancez avatar Feb 22 '21 18:02 chancez

@chancez no endpoint that I know of. My way around this is to copy a health check script to the the container image. The script runs a SQL query to make sure the dispatcher in question has polled more than X devices in the last poll period.

Here's the sql query:

SELECT pc.node_id, devices FROM poller_cluster pc JOIN poller_cluster_stats pcs ON pc.id = pcs.parent_poller WHERE poller_type = 'poller' AND node_id = '$NODE_ID'

where NODE_ID is sourced from the librenms .env file.

somewhat related - I will try to get back to the upstream librenms project to fix the current "watchdog" process to count polled devices in the python dispatcher, and stop watching log files. that would be cleaner and should be fit to enable in the docker image

haydenseitz avatar Mar 03 '21 00:03 haydenseitz