Make leader wait for expected updates to be visible in the job store, or fail the job
This fixes #3841 by adding an equivalent of Snakemake's --latency-wait option, here --jobStoreTimeout, which is the number of seconds we allow the job store to lag behind the scheduler. If this time elapses and we don't see an ostensibly successful job either remove itself or save an updated version of itself (with an incremented version clock), we will fail the job with a new PARTITION reason (since as far as we know there's a netsplit in the job store distributed system).
We should see this kind of failure in cases where the shared filesystem or other job store storage is lagging badly behind the scheduler (in which case you can increase --jobStoreTimeout), or in cases where the the job's writes are genuinely lost. We will also get warning messages for cases where a job already failed but didn't commit any updates back, which can happen when the scheduler can't even start it.
The messages look like:
Job foobarbaz v1 has no new version available after waiting 30 seconds. Either worker updates to the job store are delayed longer than your --jobStoreTimeout, or the worker trying to run the job vanished.
Marking ostensibly successful job foobarbaz v1 that did not report in to the job store before --jobStoreTimeout as having been partitioned from us.
Right now I've only plugged this in for finished jobs; there are two other points where we refresh jobs from the job store, but I'm not confident we actually expect the jobs to have always been modified, so I am leaving them alone.
Changelog Entry
To be copied to the draft changelog by merger:
- Toil will now wait
--jobStoreTimeoutseconds (default: 30) to see an update to/removal of a job that was run, and will not let the job succeed unless it is seen to make progress. - Toil job descriptions no longer have a
commandfield, and track the link to the job body and the command to invoke the Toil worker separately.
Reviewer Checklist
- [ ] Make sure it is coming from
issues/XXXX-fix-the-thingin the Toil repo, or from an external repo.- [ ] If it is coming from an external repo, make sure to pull it in for CI with:
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing - [ ] If there is no associated issue, create one.
- [ ] If it is coming from an external repo, make sure to pull it in for CI with:
- [ ] Read through the code changes. Make sure that it doesn't have:
- [ ] Addition of trailing whitespace.
- [ ] New variable or member names in
camelCasethat want to be insnake_case. - [ ] New functions without type hints.
- [ ] New functions or classes without informative docstrings.
- [ ] Changes to semantics not reflected in the relevant docstrings.
- [ ] New or changed command line options for Toil workflows that are not reflected in
docs/running/{cliOptions,cwl,wdl}.rst - [ ] New features without tests.
- [ ] Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
- [ ] Finish the review with an overall description of your opinion.
Merger Checklist
- [ ] Make sure the PR passes tests.
- [ ] Make sure the PR has been reviewed since its last modification. If not, review it.
- [ ] Merge with the Github "Squash and merge" feature.
- [ ] If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
- [ ] Copy its recommended changelog entry to the Draft Changelog.
- [ ] Append the issue number in parentheses to the changelog entry.
@adamnovak Is this ready for review?
@DailyDreaming Yes, it should be.
Do you want me to always request review from you specifically when something is ready? Usually what I do is mark it draft if it's definitely not ready, and only send a request if I get bored of waiting for you to come by and review it.
@DailyDreaming Yes, it should be.
Do you want me to always request review from you specifically when something is ready? Usually what I do is mark it draft if it's definitely not ready, and only send a request if I get bored of waiting for you to come by and review it.
@adamnovak It would help me to be flagged for review. I generally skip PRs without passing tests unless I'm flagged specifically, so that I know that despite some flaky test, you think that the PR is in a good state, and I'll look for that test(s), rerun or comment as necessary.
@DailyDreaming This should be ready!
@DailyDreaming It's not letting me dismiss your review to say I did the things you wanted. Maybe because you disn't mark it as needing changes? So to merge this I have to override the branch protection rules, which doesn't seem right.
Or am I really supposed to be getting you to say I did the things you wanted?