kismatic
kismatic copied to clipboard
Feature: Setup docker healthcheks as part installation
When docker engine crashes (due to internal bugs), worker node gets lost and requires manual intervention. There is health check available in kops/gce:
- https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/health-monitor.sh
- https://github.com/kubernetes/kops/blob/master/upup/models/nodeup/docker/_systemd/_debian_family/files/opt/kubernetes/helpers/docker-healthcheck
- https://docs.docker.com/engine/admin/live-restore/
It would be great to see something similar with kismatic.
@vasilievip what caused docker to crash?
We have seen this on client site and would like to share war stories to see if there is a common pattern.
@swade1987, actually, we did not dig into exact reason, we went over docker backlog and decided that docker has way too many reasons for crash :) We implemented python version of this solution: https://github.com/resin-os/healthdog-rs to not meet this issue again. My teammate says he can share it, but due to https://github.com/apprenda/kismatic/blob/af7d90df1ec6c553ebdd9b948dfa4f5657139096/ansible/roles/packages-docker/tasks/main.yaml#L2 more than one platform support won't be able to provide integration with kismatic by himself
@vasilievip that'd be awesome if you could share it, we can see if we can incorporate it somehow.
Here's our solution. It's implements systemd watchdog notifications. Systemd unit file should contain these lines:
ExecStart=/usr/local/bin/sd_watchdog.py -c /usr/local/bin/docker-health.sh -- /usr/bin/dockerd -H fd://
WatchdogSec=180
But it is actually a temporary solution, since I think that such feature should be implemented directly in docker. Currently docker supports only ready notification.
Also it isn't tied to docker and can be used with something different too, but it might require root privileges, since it spoofs PID. Same thing actually done by sd_pid_notify
function.