chronos icon indicating copy to clipboard operation
chronos copied to clipboard

Persisting parent execution state between leader changes

Open dlsuzuki opened this issue 8 years ago • 2 comments

We have a number of scenarios where dependent job C is a child of scheduled jobs A and B, with A and B separated by several hours. Under normal conditions this works fine, but we've observed problems in the case where the Chronos leader changes or is restarted after A runs and before B runs. The leader change causes the in-memory schedule graph to be reinitialized from scratch, with no knowledge of which edges have already been triggered. In the case described here, B runs at its normal time, but C does not think A has run and therefore does not start.

Is there any plan or workaround to persist the edge invocation counts so that the new leader process can pick up right where the old one left off? A kludgy workaround we're considering is to have a post-launch cleanup script scan the cluster schedule and apply the new markSuccessful API call to any jobs that previously succeeded on their most recent runs.

Note that we explicitly schedule single runs for each job every cycle so that each scheduled parent can be expected to succeed once and then be disabled.

dlsuzuki avatar Feb 27 '17 19:02 dlsuzuki

It's a known limitation. Currently I have no plans to add this feature, but I'm available for hire.

brndnmtthws avatar Mar 01 '17 00:03 brndnmtthws

We might take you up on that. Either that or we'll need to start actively contributing in the areas that are most important to us.

dlsuzuki avatar Mar 13 '17 18:03 dlsuzuki