scylla-manager
scylla-manager copied to clipboard
Scheduling of failed task after SM restart
Currently after SM restart, it reads all information about tasks from DB and schedules them once again.
The problem is that SM treats all tasks 'normally' and schedules them according to startDate
/interval
/cron
/window
, but it ignores things like whether last task run failed which influenced this task's next activation (it was rescheduled with given backoff).
Example of sctool tasks
output before SM restart.
╭─────────────────────────────────────────────┬──────────────┬────────┬──────────────────┬─────────┬───────┬──────────────┬────────────┬─────────────┬────────────────╮
│ Task │ Schedule │ Window │ Timezone │ Success │ Error │ Last Success │ Last Error │ Status │ Next │
├─────────────────────────────────────────────┼──────────────┼────────┼──────────────────┼─────────┼───────┼──────────────┼────────────┼─────────────┼────────────────┤
│ healthcheck/alternator │ @every 15s │ │ America/New_York │ 6 │ 0 │ 5s ago │ │ DONE │ in 9s │
│ healthcheck/cql │ @every 15s │ │ America/New_York │ 6 │ 0 │ 10s ago │ │ DONE │ in 4s │
│ healthcheck/rest │ @every 1m0s │ │ America/New_York │ 1 │ 0 │ 40s ago │ │ DONE │ in 19s │
│ repair/all-weekly │ 0 23 * * SAT │ │ America/New_York │ 0 │ 0 │ │ │ NEW │ in 6d15h53m19s │
│ repair/5f2554df-44f6-467d-81e4-3920069a5ee6 │ 1d │ │ Europe/Warsaw │ 0 │ 1 │ │ 1m6s ago │ ERROR (0/3) │ in 8m53s │
╰─────────────────────────────────────────────┴──────────────┴────────┴──────────────────┴─────────┴───────┴──────────────┴────────────┴─────────────┴────────────────╯
And after restart:
╭─────────────────────────────────────────────┬──────────────┬────────┬──────────────────┬─────────┬───────┬──────────────┬────────────┬────────┬────────────────╮
│ Task │ Schedule │ Window │ Timezone │ Success │ Error │ Last Success │ Last Error │ Status │ Next │
├─────────────────────────────────────────────┼──────────────┼────────┼──────────────────┼─────────┼───────┼──────────────┼────────────┼────────┼────────────────┤
│ healthcheck/alternator │ @every 15s │ │ America/New_York │ 7 │ 0 │ 20s ago │ │ DONE │ in 8s │
│ healthcheck/cql │ @every 15s │ │ America/New_York │ 8 │ 0 │ 10s ago │ │ DONE │ in 8s │
│ healthcheck/rest │ @every 1m0s │ │ America/New_York │ 2 │ 0 │ 10s ago │ │ DONE │ in 53s │
│ repair/all-weekly │ 0 23 * * SAT │ │ America/New_York │ 0 │ 0 │ │ │ NEW │ in 6d15h52m49s │
│ repair/5f2554df-44f6-467d-81e4-3920069a5ee6 │ 1d │ │ Europe/Warsaw │ 0 │ 1 │ │ 1m35s ago │ ERROR │ in 11h40m6s │
╰─────────────────────────────────────────────┴──────────────┴────────┴──────────────────┴─────────┴───────┴──────────────┴────────────┴────────┴────────────────╯
Is this expected behavior? Should SM preserve next task activation after restart? Or maybe it should schedule failed task to run right after restart, because perhaps now it will succeed?
@karol-kokoszka
I think the best would be to have @tzach input here. Let me copy link to this issue to the manager planning document.
Good catch. There are risks in starting all pending tasks when Manager restart:
- cloud be more than one task, overload the system
- the time of the day might be wrong, user would not want to start a task in bust hours
As such, I think the current is safer, but need to be documented. User can start an ad-hoc task if he is not happy with this logic.
What is a good place to store this information in documentation? I see three candidates:
- general sctool description
- new subsection in sctool like: Scylla Manager server restart
- in
--retry
and--retry-wait
flag descriptions (because those flags are being "violated" in this case)
I sugget the first (general sctool description), with a link from the retry flags