runbot
runbot copied to clipboard
Migrate non-regular crons to triggers
Currently out of 12 crons only a small minority are time-based:
- weekly maintenance of the repository cache
- reminder of open PR
- removal of merged PR branches
The cron for FW cron could be argued either way (don't think it's currently plugged into the statuses updates), the one to check linked PRs as well.
But pretty much all of the crons on a 1mn timer are really triggered events in response to:
- a new status (PR validation, staging validation)
- a comment (r+)
- feedback or tags to a PR
- a fetch request
- cascading an update on a forward port
It's not like they generate a lot of load (probably) but moving them to triggers would likely clean up the logs (currently quite full of cron checks which do nothing).
Mergebot crons and dependencies I can think of
- [ ]
fetch_prs_cron(check for PRs to fetch) should depend on addition of a PR to the fetch queue - [x]
process_updated_commits(impact statuses on PRs and stagings), converted in 7cd9afe7f2ff2cbaaabe0ecf6eb364b55aaf8d60 - [ ]
merge_cron(check for progress of stagings, merge into branch on success)- success status on staging
- failure status on staging
- [ ]
staging_cron(check for progress of PRs and create stagings)- transition causes PR to become ready
- set merge method (?)
- move out of draft (?)
- staging cancelled
- [ ]
check_linked_prs_status(warn on linked PRs where only one is ready)- transition causes a PR to become ready (I think?)
- [ ]
feedback_cron(send feedback)- feedback was added to the queue
- [ ]
labels_cron(update labels)- tagging record was created (essentially the tagging queue)
Notes from initial (completely failed) attempt:
- some of the crons clearly have non-obvious dependencies, so the cron triggers should probably be evaluated and implemented individually to not get completely lost
- cron-trigger worker (aka how to run triggered crons as opposed to bulk-run a bunch of enumerated crons) is not clear, the goal is to limit / get rid of the enumeration
2fb26f10fb854f2fed0dfafe709772ec4c5f46dc adds a bridge method on the base object which allows running triggered crons during tests, which seems like a good start but this still needs to be invoked during tests, somehow.
Having thought about / fiddled with this a bit more, I think I've gathered why the previous effort died miserably as I did a straightforward conversion then got left with a mess in my hand, there are two reasons one of which is east to resolve and the other not
Cron ordering / dependencies
The ordering in which the test suite triggers crons is carefully curated such that they flow into one another by dependency, and thus a single cron run should be able to apply all the stuff:
- statuses are transferred from commits to PR and stagings, which allows validating (or failing) stagings and ready-ing PRs
- current stagings are checked, leaving room for immediate re-staging if there are ready PRs
- new stagings are created
- feedback from all previous operations is sent
But this sequencing is only encoded in the test suites, so when it got ignored and the crons got run directly the stagings might be checked before statuses leading to the expected staging success being ignored.
This can be fixed by specifying sequence on the crons, as _process_jobs will run them in that order if they're all ready...
Dynamic triggering
"If they're all ready" is the second issue at hand: ideally crons should trigger one another e.g. if there's no update to staging statuses there's no reason to check for the status state, so the transferring of statuses to stagings should be the one to _trigger the stagings cron.
However _process_jobs fetches the list of ready crons then goes through the entire thing then stops, it does not support cascading triggers. This doesn't seem to have a ready solution, the only thing I can think of is to _process_jobs until all cron triggers are resolved. Possibly straight up reimplement the most salient bits of _process_jobs locally to do that and no more, as technically most of the complexity in _process_jobs is for resilience to a bunch of error conditions which don't really make sense when running mergebot tests:
- version mismatches can't happen
- module in transient states can't happen
- there can't be an other cron worker
So all we need to do is loop around until running the first ready job until _get_all_ready_jobs is exhausted...