containerpilot Full event bus causing job timeouts.

Background

We have uncovered what might be a bug with the job event bus getting overflowed when running jobs on an interval. In our case, we have scripts running on 30s intervals over a long period. Over time as data has flowed into the system these scripts have sometimes taken longer than the interval time to execute which we countered by setting a timeout that is greater than the interval time. We have multiple containers with different job configurations and we have seen that after a varying amount of time (usually a couple of hours) the jobs start to always timeout even though it seems that they are successfully executed. This makes e.g. containerpilot marking health checks as failed even though the health check command actually succeeded.

Containerpilot version

3.6.1

Problem location

We think we have pinpointed the problem to https://github.com/joyent/containerpilot/blob/7208e025644e9d9b0ca06d26ee97a4c612af7788/commands/commands.go#L158 where somehow the event never gets published. Containerpilot waits for that line to finish execution within the configured deadline which does not happen, and the Context and thus the job "times out".

Reproduction

To reproduce the problem I have created a rather extreme and unrealistic configuration to encounter the problem quickly. Instead of a job taking between 20-40 seconds that runs every 30th second I have a job taking 10 seconds running every 10th millisecond.

containerpilot.json5:

{
  consul: "consul:8500",
  logging: {
    level: "DEBUG",
    format: "default",
    output: "stdout"
  },
  jobs: [
    {
      name: "test",
      restarts: "unlimited",
      port: 80,
      health: {
        exec: "/test.py",
        interval: 20,
        ttl: 20,
        timeout: "20s",
      },
    },
    {
      name: "interval",
      restarts: "unlimited",
      exec: ["/test.py"],
      timeout: "20s",
      when: {
        interval: "10ms"
      }
    },
  ],
  control: {
    socket: "/var/run/containerpilot.socket"
  }
}

test.py:

#!/usr/bin/env python3
import time
import sys
import logging


logger = logging.getLogger(__name__)


def _setup_logging(logger, loglevel=logging.INFO):
    root = logging.getLogger()
    root.setLevel(loglevel)

    formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    handler = logging.StreamHandler()
    handler.setFormatter(formatter)

    logger.addHandler(handler)


def main():
    logger.warn('FUNNY OUTPUT')
    time.sleep(10)
    sys.exit(0)


if __name__ == '__main__':
    sys.exit(main())

On my machine the above starts encountering the problem in about 10-15 seconds with the following logs when trying to run the health check and interval job.

2018-02-20T18:26:42.77342415+01:00 interval.Run start
2018-02-20T18:26:42.773593279+01:00 timer: {TimerExpired interval.run-every}
2018-02-20T18:26:42.864558348+01:00 FUNNY OUTPUT
2018-02-20T18:26:52.901450783+01:00 interval exited without error
2018-02-20T18:27:02.77381208+01:00 interval timeout after 20s: '[]'
2018-02-20T18:27:02.773881327+01:00 interval.kill
2018-02-20T18:27:02.773908902+01:00 killing command 'interval' at pid: 60

2018-02-20T18:28:22.80610908+01:00 check.test.Run start
2018-02-20T18:28:22.847114529+01:00 FUNNY OUTPUT
2018-02-20T18:28:32.88316921+01:00 check.test exited without error
2018-02-20T18:28:42.757582579+01:00 timer: {TimerExpired test.heartbeat}
2018-02-20T18:28:42.806392321+01:00 check.test timeout after 20s: '[]'
2018-02-20T18:28:42.806441683+01:00 check.test.kill
2018-02-20T18:28:42.806461393+01:00 killing command 'check.test' at pid: 71

The logs clearly shows that the jobs actually executes due to the "FUNNY OUTPUT" and "* exited without error." However, the events are never sent which can be seen by the missing event: {ExitSuccess interval} and event: {ExitSuccess check.test} log entries. Those logs should be the last thing happening before the job finishes.

Workaround

While it might be unclever to have interval jobs on a shorter interval than the actual execution time, it can be hard to decide a appropriate interval time on a job with varying execution time. We think we have found a workaround by using regular jobs with `restarts: unlimited" with sleeps included in the scripts instead of utilizing the interval functionality.

Feb 20 '18 17:02 eliasbrange

The event bus getting full is suspected since when I tested increasing the buffer size by 10x https://github.com/joyent/containerpilot/blob/7208e025644e9d9b0ca06d26ee97a4c612af7788/jobs/jobs.go#L23 the above configuration was able to run 10x longer before the jobs started timing out.

Feb 20 '18 18:02 eliasbrange

Which seems to be because https://github.com/joyent/containerpilot/blob/7208e025644e9d9b0ca06d26ee97a4c612af7788/events/bus.go#L137 tries to send an Event to the same event channel that is filled up by all events.TimerExpired events, which makes it wait until there is a spot free on the channel.

Feb 20 '18 20:02 eliasbrange

Looking at the subscribers for the bus the two jobs seems to be subscribed to each other, which means that the events.ExitSuccess of the health check command is published to the interval job, which has a full event queue, which in turn makes the health check command "time out". So one job with a full event queue will hinder execution of other jobs.

Feb 20 '18 21:02 eliasbrange

First off, thank you for reporting this issue. It is definitely worth receiving "in the field" experiences because the variables involved are wildly different between use cases. There's a lot of situations where ContainerPilot can be useful and others where it might not perform well or make sense at all.

Generally, I feel like interval jobs are always going to be handled better within your own application, as a long running task. You have more freedom to decide what and where things are processed and what failure modes you might need to respond to.

The processing of health checks is definitely something we'd need to keep our eye on.

The "event bus" makes up two unique buffers in this part of the code, a ring buffer and a Go channel buffer. The channel is used to enqueue events into the ring buffer. It sounds like some sort of process is leaking and/or not completing because a ring buffer should be processed in constant time.

I'm reluctant to add much else to the conversation at the moment. I'll have to verify and observe the behavior myself, possibly outputting the state of the buffer in order to monitor things.

Feb 26 '18 21:02 jwreagor

We replaced all of our interval jobs and handle the waiting inside of our jobs.

Having a job running on an interval with a timeout that is longer than the interval might eventually lead to the above happening after a container has been running for a long time. It might be worth considering enforcing the timeout to be less than or equal to the interval? Having a job that might take 60 seconds running every 30 seconds doesn't make sense anyway. And in the case of varying execution time, there exists other options such as handling the delay inside the job itself.

Feb 27 '18 09:02 eliasbrange

Hello, I believe we just hit this issue in our production https://github.com/autopilotpattern/mysql We have the mysql healthcheck doing a interval of 10s and timeout of 30s. We don't have lots of scheduled jobs running like you did, only one snapshot task every hour. I see the script exit with status 0 check.mysql-primary exited without error but then containerpilot doesn't recognize it like it usually does event: {ExitSuccess check.mysql-primary}

Running /etc/containerpilot.json

{  
   "jobs":[  
      {  
         "restarts":0,
         "stopTimeout":"10s",
         "name":"preStart",
         "timeout":"120s",
         "exec":"python /usr/local/bin/manage.py"
      },
      {  
         "port":3306,
         "health":{  
            "ttl":120,
            "interval":10,
            "timeout":30,
            "exec":"python /usr/local/bin/manage.py health"
         },
         "when":{  
            "source":"preStart",
            "once":"stopped"
         },
         "name":"mysql-primary",
         "exec":[  
            "mysqld",
            "--console",
            "--log-bin=mysql-bin",
            "--log_slave_updates=ON",
            "--gtid-mode=ON",
            "--enforce-gtid-consistency=ON"
         ]
      },
      {  
         "when":{  
            "source":"watch.mysql-primary",
            "each":"changed"
         },
         "name":"onChange",
         "exec":"python /usr/local/bin/manage.py on_change"
      },
      {  
         "when":{  
            "interval":"5m"
         },
         "name":"snapshot-check",
         "timeout":"10m",
         "exec":"python /usr/local/bin/manage.py snapshot_task"
      }
   ],
   "consul":"localhost:8500",
   "telemetry":{  
      "port":9090
   },
   "logging":{  
      "level":"DEBUG"
   },
   "watches":[  
      {  
         "interval":10,
         "name":"mysql-primary"
      }
   ]
}

Log causing failures, note the **ed lines

2018-04-29T16:48:39.248046797Z DEBUG manage [health] consul.mark_as_primary end
2018-04-29T16:48:39.248084674Z DEBUG manage [health] consul.unlock_failover start
2018-04-29T16:48:39.248146903Z DEBUG manage [health] consul.unlock_failover end
2018-04-29T16:48:39.24817468Z DEBUG manage [health] node.health end
**2018-04-29T16:48:39.285789194Z check.mysql-primary exited without error**
2018-04-29T16:48:43.759939183Z timer: {TimerExpired watch.mysql-primary.poll}
2018-04-29T16:48:53.75972545Z timer: {TimerExpired watch.mysql-primary.poll}
    **2018/04/29 16:48:53 [WARN] agent: Check 'service:mysql-primary-95c9db29f0fe' missed TTL, is now critical**
    2018/04/29 16:48:53 [INFO] agent: Synced check 'service:mysql-primary-95c9db29f0fe'
    2018/04/29 16:48:55 [INFO] agent: Synced check 'service:containerpilot-95c9db29f0fe'
2018-04-29T16:48:58.760633869Z timer: {TimerExpired containerpilot.heartbeat}
2018-04-29T16:49:03.759480033Z timer: {TimerExpired watch.mysql-primary.poll}

Healthy log

2018-04-29T19:25:53.951816463Z DEBUG manage [health] node.health end
**2018-04-29T19:25:54.017636715Z check.mysql-primary exited without error**
**2018-04-29T19:25:54.017671577Z event: {ExitSuccess check.mysql-primary}**
2018-04-29T19:25:54.0176981Z check.mysql-primary.Run end
2018-04-29T19:25:54.017727709Z check.mysql-primary.term
2018-04-29T19:25:54.017733556Z terminating command 'check.mysql-primary' at pid: 39072
2018-04-29T19:25:54.01775097Z event: {StatusHealthy mysql-primary}

Apr 29 '18 19:04 dfredell

containerpilot containerpilot copied to clipboard

Full event bus causing job timeouts.

Background

Containerpilot version

Problem location

Reproduction

Workaround

containerpilot
containerpilot copied to clipboard