worker icon indicating copy to clipboard operation
worker copied to clipboard

Verify bootstrap success and instance health

Open soulshake opened this issue 6 years ago • 3 comments

Instances sometimes fail to bootstrap:

  • Start hooks can fail to download [1]
  • SSH public keys can fail to download

Instances sometimes become unhealthy in ways that aren't measured by our health checks:

  • Certain Docker commands hang forever [2]
  • Abusive jobs can hog CPU (to be addressed in https://github.com/travis-ci/worker/pull/366)

We need a way to ensure instance health at bootstrap and on an ongoing basis. I'd like to use this issue as a place to brainstorm on design. (If a similar issue already exists somewhere, please point me to it!)

I think if I needed such a check on a bunch of my own servers, I'd use an approach like the following:

  • Create a /tmp/health directory
  • make the cloud init script write results to this directory, e.g. /tmp/health/cloud-init.ok if everything completed successfully, /tmp/health/cloud-init.nok if any errors were encountered
  • Use a cron job to occasionally check the status of required services (docker, travis-worker) and take appropriate action (e.g. restarting Docker, imploding the instance)

One problem: The only way I know to confirm that docker isn't working as expected is to try a command, e.g. docker ps, and observe that it just hangs forever. I'm not sure how to check this in a script without making the script hang forever, too. Maybe we could:

  • run docker ps&, wait a few seconds, then check if a process with that PID is still running?
  • check the modification date on docker log file?

Thoughts?

soulshake avatar Sep 11 '17 11:09 soulshake

I would love to add another layer of health monitoring. Based on what you're writing above, I believe this would be specific to the instances brought up in autoscaling groups on EC2, and that we would add such a layer of health monitoring at the terraform-config level.

With regard to the specific case of docker ps not coming back, I am in favor of treating a timeout as a failure condition and imploding the host. I think it's great if we can do this with bash, but I'm also happy to use a different programming language that's already present on the system such as python.

meatballhat avatar Sep 11 '17 14:09 meatballhat

I think that some of the other health checks mentioned could be integrated into the worker "prestart hook" script, too, so that workers that are unhealthy never come into service: https://github.com/travis-infrastructure/terraform-config/blob/master/modules/aws_asg/prestart-hook.bash

meatballhat avatar Sep 11 '17 14:09 meatballhat

As a temporary workaround to locate dead workers, here's a script I'm using:

#!/bin/bash
# Usage:
# get instance ips: $HOME/git/travis/bin/private-ec2-ips.sh > ips.txt
# Then, from a bastion:
# parallel-scp -h ips.txt check-health.sh /tmp/check-health.sh
# parallel-ssh -x '-tt' -O RequestTTY=force -h ips.txt -o outdir -e errdir bash -c /tmp/check-health.sh
# grep NOK outdir/*

# In some cases, the services can be successfully restarted via SSH to the instance:
# service travis-worker restart
# sudo restart docker

my_ip="$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)"

# Sometimes, the docker service will be running, but certain commands (docker ps) will hang indefinitely.
sudo docker ps  > /dev/null 2>&1 &
sleep 3
jobs=$(jobs -l)

docker_ps_pid=$(echo "${jobs}" | grep -v Done | grep "sudo docker ps" | awk '{printf $2}')

if [ ! -z "${docker_ps_pid}" ]; then
    echo "[NOK] $my_ip 'docker ps' is stalled; 'docker ps' PID is ${docker_ps_pid}"
    exit 1
else
    echo " [OK] 'docker ps'"
fi


# Check the status of required services
services="
    travis-worker
    docker
"

for service in $services; do
    # Dirty hack sometimes I test this on my own machine
    type -a systemctl > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        status_cmd="systemctl show $service --property=SubState --value"
        expected_result="running"
    else
        status_cmd="status $service"
        expected_result="running"
    fi

    service_status="$(eval $status_cmd)"
    service_ok=$(echo "${service_status}" | grep "$expected_result")

    if [ -z "${service_ok}" ]; then
        echo "[NOK] $my_ip $service. status is: ${service_status}"
        exit 1
    else
        echo " [OK] $service"
    fi
done

soulshake avatar Sep 12 '17 14:09 soulshake