containerpilot
containerpilot copied to clipboard
Allow to start a job after n other jobs have started
Sometimes, we need to define job dependencies that are non-linear. Given jobs A,B & C, job C might depend on A & B being healthy, however A doesn't depend on B or B on A.
At the moment, the only way I could find to express this dependency graph, was to create an artificial dependency between A & B and then make C depend on B. This slows down startup.
I suggest that something like this could be implemented:
jobs: [
{
name: "A",
exec: "A.sh",
},
{
name: "B",
exec: "B.sh",
},
{
name: "C",
exec: "C.sh",
when: {
source: ["A","B"],
once: "healthy"
}
}
`
]
The big picture need for this seems sound. The details look complicated. I think we need to explore the edge cases, particularly around each
vs once
and some of the non-health-related events. I also want to make sure that adding the flexibility doesn't make it much more difficult for an end-user to understand what's going on. Here's 3 general cases that I have concerns about, but I'd love if we can explore any further cases:
Case 1: multiple sources, once healthy
when: {
source: ["A", "B"],
once: "healthy"
}
This was your original example. Note that there's an implicit AND here. We're saying execute one time, after both A and B are healthy. One corner case is what might we expect to happen if A becomes healthy, then A becomes unhealthy, and then B becomes healthy? We respond to events, not state, so that implies that each job will have to track not just its own state but the state of its triggering events as well.
It looks like the case of exitSuccess
, exitFailed
, and changed
all have the same set of state behaviors.
Case 2: multiple sources, each healthy
when: {
source: ["A", "B"],
each: "healthy"
}
This case takes the previous case and complicates it. The language of "each" kind of implies that we're now OR'ing the health states rather than AND'ing them, but it explicitly means that we run the job on each healthy
event.
Like case 1, it looks like the case of exitSuccess
, exitFailed
, and changed
all have the same set of state behaviors.
Case 3: multiple sources, once stopping
when: {
source: ["A", "B"],
once: "stopping"
}
We have state tracking again as per case 1. In this case we're responding to an event, but that event signals that we've entered an implicit "stopping state" that exists until we receive the stopped
event. So even if we track state as per case 1 and 2 above, what would be the expected behavior if A fires stopping
, A fires stopped
, and then B fires stopping
?
Curious how the state tracking will take place. Isn't the event bus already holding this state and you just need this type of job's event to observe subsequent events in order to fire?
I'd have to dig but I'm unsure if the bus was designed in that way. My hope would be that you could remove the hard dependency tracking out of some sort of global state manager and into already existing behavior.
The bus is a dumb publisher. Each job tracks its own state (via things like restartsRemain
or startEvent
), which is why we did things like set the start event to NonEvent
in #438.
Of course, right where it was yesterday. I consistently over think the utility of that bus.
I see this is more complicated than I thought. Is there any other initiative to add state tracking to CP? I'm happy to keep using my "solution" if that's the way it is. I just thought it was a valid use case.
As an MVP, would it be simpler if there was only support for once: Healthy
or once: exitSuccess
as in "after" this n things are healthy/started, launch and then is up to the app to react to events and other dependencies going down.
Is there any other initiative to add state tracking to CP? I'm happy to keep using my "solution" if that's the way it is. I just thought it was a valid use case.
It does seem valid, for sure. But yeah it's just complicated. We don't have any other initiative doing state tracking other than the state of the job itself.
As an MVP, would it be simpler if there was only support for once: Healthy or once: exitSuccess as in "after" this n things are healthy/started, launch and then is up to the app to react to events and other dependencies going down.
That might be plausible. I do worry such a restriction on having multiple each
or multiple stopping
event handlers might seem arbitrary to users, but we have other places where we've had to say "we just don't support that because supporting it will be even more confusing".
Noting for myself that there's a lot of under-the-hook implementation overlap between the issues in #435, #416, and #396
Hi,
We have a case that is related to this issue and also #416 and #518, where we hit a race condition between an on-change job and a pre-start job. Given the following containerpilot jobs:
{
name: 'pre-start',
exec: '/usr/local/bin/app-manage preStart',
when: {
source: 'watch.squid-gcp-proxy',
once: 'healthy'
}
}
{
name: 'on-change-squid-gcp-proxy',
exec: '/usr/local/bin/app-manage reload',
when: {
source: 'watch.squid-gcp-proxy',
each: 'changed'
}
}
{
name: 'apache-fwdproxy',
exec: '/usr/local/apache/bin/apachectl -Xf /etc/apache-fwdproxy/httpd.conf -k start -D APACHE-FWDPROXY',
restarts: 3,
port: '33000',
health: {
exec: '/usr/local/bin/app-manage health',
interval: 10,
ttl: 30,
timeout: 3,
},
tags: [
'apache',
'googleproxy'
],
consul: {
enableTagOverride: true,
deregisterCriticalServiceAfter: '10m'
},
when: {
source: 'pre-start',
once: 'exitSuccess'
}
}
...And the script functions as follows:
preStart() {
_log "Configuring application"
touch /usr/local/apache/htdocs/health
configureApp
}
health() {
msg=$(curl --fail -sS http://localhost:33000/health)
status=$?
if [ ! ${status} -eq 0 ]; then
echo ${msg}
exit ${status}
else
return ${status}
fi
}
reload() {
_log "Configuring application"
configureApp
_log "reloading application"
/usr/local/apache/bin/apachectl \
-f /etc/apache-fwdproxy/httpd.conf \
-k graceful \
-D APACHE-FWDPROXY
}
Sometimes apache is started with graceful
instead of start
and then fails to run or reconfigure in a consistent and reliable fashion.
This issue was resolved by changing reload()
to:
reload() {
health
if [ $? -eq 0 ]; then
_log "Configuring application"
configureApp
_log "reloading application"
/usr/local/apache/bin/apachectl \
-f /etc/apache-fwdproxy/httpd.conf \
-k graceful \
-D APACHE-FWDPROXY
else
_log "WARNING: application not running. Can't reload"
fi
}
I totally understand the design decision to emit a changed
and healthy
event, so it would be really nice to be able to handle this by better functionality in when
. At least clearer documentation around the flow of even messages - in particular how changed and healthy are both emitted together.