worker
worker copied to clipboard
Verify bootstrap success and instance health
Instances sometimes fail to bootstrap:
- Start hooks can fail to download [1]
- SSH public keys can fail to download
Instances sometimes become unhealthy in ways that aren't measured by our health checks:
- Certain Docker commands hang forever [2]
- Abusive jobs can hog CPU (to be addressed in https://github.com/travis-ci/worker/pull/366)
We need a way to ensure instance health at bootstrap and on an ongoing basis. I'd like to use this issue as a place to brainstorm on design. (If a similar issue already exists somewhere, please point me to it!)
I think if I needed such a check on a bunch of my own servers, I'd use an approach like the following:
- Create a
/tmp/health
directory - make the cloud init script write results to this directory, e.g.
/tmp/health/cloud-init.ok
if everything completed successfully,/tmp/health/cloud-init.nok
if any errors were encountered - Use a cron job to occasionally check the status of required services (
docker
,travis-worker
) and take appropriate action (e.g. restarting Docker, imploding the instance)
One problem: The only way I know to confirm that docker
isn't working as expected is to try a command, e.g. docker ps
, and observe that it just hangs forever. I'm not sure how to check this in a script without making the script hang forever, too. Maybe we could:
- run
docker ps&
, wait a few seconds, then check if a process with that PID is still running? - check the modification date on docker log file?
Thoughts?
I would love to add another layer of health monitoring. Based on what you're writing above, I believe this would be specific to the instances brought up in autoscaling groups on EC2, and that we would add such a layer of health monitoring at the terraform-config level.
With regard to the specific case of docker ps
not coming back, I am in favor of treating a timeout as a failure condition and imploding the host. I think it's great if we can do this with bash
, but I'm also happy to use a different programming language that's already present on the system such as python
.
I think that some of the other health checks mentioned could be integrated into the worker "prestart hook" script, too, so that workers that are unhealthy never come into service: https://github.com/travis-infrastructure/terraform-config/blob/master/modules/aws_asg/prestart-hook.bash
As a temporary workaround to locate dead workers, here's a script I'm using:
#!/bin/bash
# Usage:
# get instance ips: $HOME/git/travis/bin/private-ec2-ips.sh > ips.txt
# Then, from a bastion:
# parallel-scp -h ips.txt check-health.sh /tmp/check-health.sh
# parallel-ssh -x '-tt' -O RequestTTY=force -h ips.txt -o outdir -e errdir bash -c /tmp/check-health.sh
# grep NOK outdir/*
# In some cases, the services can be successfully restarted via SSH to the instance:
# service travis-worker restart
# sudo restart docker
my_ip="$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)"
# Sometimes, the docker service will be running, but certain commands (docker ps) will hang indefinitely.
sudo docker ps > /dev/null 2>&1 &
sleep 3
jobs=$(jobs -l)
docker_ps_pid=$(echo "${jobs}" | grep -v Done | grep "sudo docker ps" | awk '{printf $2}')
if [ ! -z "${docker_ps_pid}" ]; then
echo "[NOK] $my_ip 'docker ps' is stalled; 'docker ps' PID is ${docker_ps_pid}"
exit 1
else
echo " [OK] 'docker ps'"
fi
# Check the status of required services
services="
travis-worker
docker
"
for service in $services; do
# Dirty hack sometimes I test this on my own machine
type -a systemctl > /dev/null 2>&1
if [ $? -eq 0 ]; then
status_cmd="systemctl show $service --property=SubState --value"
expected_result="running"
else
status_cmd="status $service"
expected_result="running"
fi
service_status="$(eval $status_cmd)"
service_ok=$(echo "${service_status}" | grep "$expected_result")
if [ -z "${service_ok}" ]; then
echo "[NOK] $my_ip $service. status is: ${service_status}"
exit 1
else
echo " [OK] $service"
fi
done