containerpilot icon indicating copy to clipboard operation
containerpilot copied to clipboard

Allow to start a job after n other jobs have started

Open mterron opened this issue 7 years ago • 8 comments

Sometimes, we need to define job dependencies that are non-linear. Given jobs A,B & C, job C might depend on A & B being healthy, however A doesn't depend on B or B on A.

At the moment, the only way I could find to express this dependency graph, was to create an artificial dependency between A & B and then make C depend on B. This slows down startup.

I suggest that something like this could be implemented:

jobs: [
  {
    name: "A",
    exec: "A.sh",
  },
  {
    name: "B",
    exec: "B.sh",
  },
  {
    name: "C",
    exec: "C.sh",
    when: {
      source: ["A","B"],
      once: "healthy"
    }
  }
`
]

mterron avatar Jul 10 '17 21:07 mterron

The big picture need for this seems sound. The details look complicated. I think we need to explore the edge cases, particularly around each vs once and some of the non-health-related events. I also want to make sure that adding the flexibility doesn't make it much more difficult for an end-user to understand what's going on. Here's 3 general cases that I have concerns about, but I'd love if we can explore any further cases:


Case 1: multiple sources, once healthy

when: {
  source: ["A", "B"],
  once: "healthy"
}

This was your original example. Note that there's an implicit AND here. We're saying execute one time, after both A and B are healthy. One corner case is what might we expect to happen if A becomes healthy, then A becomes unhealthy, and then B becomes healthy? We respond to events, not state, so that implies that each job will have to track not just its own state but the state of its triggering events as well.

It looks like the case of exitSuccess, exitFailed, and changed all have the same set of state behaviors.


Case 2: multiple sources, each healthy

when: {
  source: ["A", "B"],
  each: "healthy"
}

This case takes the previous case and complicates it. The language of "each" kind of implies that we're now OR'ing the health states rather than AND'ing them, but it explicitly means that we run the job on each healthy event.

Like case 1, it looks like the case of exitSuccess, exitFailed, and changed all have the same set of state behaviors.


Case 3: multiple sources, once stopping

when: {
  source: ["A", "B"],
  once: "stopping"
}

We have state tracking again as per case 1. In this case we're responding to an event, but that event signals that we've entered an implicit "stopping state" that exists until we receive the stopped event. So even if we track state as per case 1 and 2 above, what would be the expected behavior if A fires stopping, A fires stopped, and then B fires stopping?

tgross avatar Jul 11 '17 17:07 tgross

Curious how the state tracking will take place. Isn't the event bus already holding this state and you just need this type of job's event to observe subsequent events in order to fire?

I'd have to dig but I'm unsure if the bus was designed in that way. My hope would be that you could remove the hard dependency tracking out of some sort of global state manager and into already existing behavior.

jwreagor avatar Jul 19 '17 13:07 jwreagor

The bus is a dumb publisher. Each job tracks its own state (via things like restartsRemain or startEvent), which is why we did things like set the start event to NonEvent in #438.

tgross avatar Jul 19 '17 14:07 tgross

Of course, right where it was yesterday. I consistently over think the utility of that bus.

jwreagor avatar Jul 19 '17 14:07 jwreagor

I see this is more complicated than I thought. Is there any other initiative to add state tracking to CP? I'm happy to keep using my "solution" if that's the way it is. I just thought it was a valid use case.

As an MVP, would it be simpler if there was only support for once: Healthy or once: exitSuccess as in "after" this n things are healthy/started, launch and then is up to the app to react to events and other dependencies going down.

mterron avatar Jul 25 '17 00:07 mterron

Is there any other initiative to add state tracking to CP? I'm happy to keep using my "solution" if that's the way it is. I just thought it was a valid use case.

It does seem valid, for sure. But yeah it's just complicated. We don't have any other initiative doing state tracking other than the state of the job itself.

As an MVP, would it be simpler if there was only support for once: Healthy or once: exitSuccess as in "after" this n things are healthy/started, launch and then is up to the app to react to events and other dependencies going down.

That might be plausible. I do worry such a restriction on having multiple each or multiple stopping event handlers might seem arbitrary to users, but we have other places where we've had to say "we just don't support that because supporting it will be even more confusing".

tgross avatar Jul 26 '17 18:07 tgross

Noting for myself that there's a lot of under-the-hook implementation overlap between the issues in #435, #416, and #396

tgross avatar Aug 03 '17 13:08 tgross

Hi,

We have a case that is related to this issue and also #416 and #518, where we hit a race condition between an on-change job and a pre-start job. Given the following containerpilot jobs:

    {
      name: 'pre-start',
      exec: '/usr/local/bin/app-manage preStart',
      when: {
        source: 'watch.squid-gcp-proxy',
        once: 'healthy'
      }
    }
    {
      name: 'on-change-squid-gcp-proxy',
      exec: '/usr/local/bin/app-manage reload',
      when: {
        source: 'watch.squid-gcp-proxy',
        each: 'changed'
      }
    }
    {
      name: 'apache-fwdproxy',
      exec: '/usr/local/apache/bin/apachectl -Xf /etc/apache-fwdproxy/httpd.conf -k start -D APACHE-FWDPROXY',
      restarts: 3,
      port: '33000',
      health: {
        exec: '/usr/local/bin/app-manage health',
        interval: 10,
        ttl: 30,
        timeout: 3,
      },
      tags: [
        'apache',
        'googleproxy'
      ],
      consul: {
        enableTagOverride: true,
        deregisterCriticalServiceAfter: '10m'
      },
      when: {
        source: 'pre-start',
        once: 'exitSuccess'
      }
    }

...And the script functions as follows:

preStart() {
    _log "Configuring application"
    touch /usr/local/apache/htdocs/health
    configureApp
}


health() {
    msg=$(curl --fail -sS http://localhost:33000/health)
    status=$?
    if [ ! ${status} -eq 0 ]; then
        echo ${msg}
        exit ${status}
    else
        return ${status}
    fi
}

reload() {
    _log "Configuring application"
    configureApp
    _log "reloading application"
    /usr/local/apache/bin/apachectl \
          -f /etc/apache-fwdproxy/httpd.conf \
          -k graceful \
          -D APACHE-FWDPROXY
}

Sometimes apache is started with graceful instead of start and then fails to run or reconfigure in a consistent and reliable fashion.

This issue was resolved by changing reload() to:

reload() {
    health
    if [ $? -eq 0 ]; then
        _log "Configuring application"
        configureApp
        _log "reloading application"
        /usr/local/apache/bin/apachectl \
            -f /etc/apache-fwdproxy/httpd.conf \
            -k graceful \
            -D APACHE-FWDPROXY
    else
        _log "WARNING: application not running. Can't reload"
    fi

}

I totally understand the design decision to emit a changed and healthy event, so it would be really nice to be able to handle this by better functionality in when. At least clearer documentation around the flow of even messages - in particular how changed and healthy are both emitted together.

gbmeuk avatar Jan 15 '18 13:01 gbmeuk