kismatic icon indicating copy to clipboard operation
kismatic copied to clipboard

Feature: Setup docker healthcheks as part installation

Open vasilievip opened this issue 6 years ago • 4 comments

When docker engine crashes (due to internal bugs), worker node gets lost and requires manual intervention. There is health check available in kops/gce:

  • https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/health-monitor.sh
  • https://github.com/kubernetes/kops/blob/master/upup/models/nodeup/docker/_systemd/_debian_family/files/opt/kubernetes/helpers/docker-healthcheck
  • https://docs.docker.com/engine/admin/live-restore/

It would be great to see something similar with kismatic.

vasilievip avatar Oct 08 '17 07:10 vasilievip

@vasilievip what caused docker to crash?

We have seen this on client site and would like to share war stories to see if there is a common pattern.

swade1987 avatar Dec 01 '17 10:12 swade1987

@swade1987, actually, we did not dig into exact reason, we went over docker backlog and decided that docker has way too many reasons for crash :) We implemented python version of this solution: https://github.com/resin-os/healthdog-rs to not meet this issue again. My teammate says he can share it, but due to https://github.com/apprenda/kismatic/blob/af7d90df1ec6c553ebdd9b948dfa4f5657139096/ansible/roles/packages-docker/tasks/main.yaml#L2 more than one platform support won't be able to provide integration with kismatic by himself

vasilievip avatar Dec 01 '17 12:12 vasilievip

@vasilievip that'd be awesome if you could share it, we can see if we can incorporate it somehow.

swade1987 avatar Dec 01 '17 12:12 swade1987

Here's our solution. It's implements systemd watchdog notifications. Systemd unit file should contain these lines:

ExecStart=/usr/local/bin/sd_watchdog.py -c /usr/local/bin/docker-health.sh -- /usr/bin/dockerd -H fd://
WatchdogSec=180

But it is actually a temporary solution, since I think that such feature should be implemented directly in docker. Currently docker supports only ready notification. Also it isn't tied to docker and can be used with something different too, but it might require root privileges, since it spoofs PID. Same thing actually done by sd_pid_notify function.

bacher09 avatar Dec 01 '17 13:12 bacher09