scylla-manager Scheduling of failed task after SM restart

Currently after SM restart, it reads all information about tasks from DB and schedules them once again. The problem is that SM treats all tasks 'normally' and schedules them according to startDate/interval/cron/window, but it ignores things like whether last task run failed which influenced this task's next activation (it was rescheduled with given backoff).

Example of sctool tasks output before SM restart.

╭─────────────────────────────────────────────┬──────────────┬────────┬──────────────────┬─────────┬───────┬──────────────┬────────────┬─────────────┬────────────────╮
│ Task                                        │ Schedule     │ Window │ Timezone         │ Success │ Error │ Last Success │ Last Error │ Status      │ Next           │
├─────────────────────────────────────────────┼──────────────┼────────┼──────────────────┼─────────┼───────┼──────────────┼────────────┼─────────────┼────────────────┤
│ healthcheck/alternator                      │ @every 15s   │        │ America/New_York │ 6       │ 0     │ 5s ago       │            │ DONE        │ in 9s          │
│ healthcheck/cql                             │ @every 15s   │        │ America/New_York │ 6       │ 0     │ 10s ago      │            │ DONE        │ in 4s          │
│ healthcheck/rest                            │ @every 1m0s  │        │ America/New_York │ 1       │ 0     │ 40s ago      │            │ DONE        │ in 19s         │
│ repair/all-weekly                           │ 0 23 * * SAT │        │ America/New_York │ 0       │ 0     │              │            │ NEW         │ in 6d15h53m19s │
│ repair/5f2554df-44f6-467d-81e4-3920069a5ee6 │ 1d           │        │ Europe/Warsaw    │ 0       │ 1     │              │ 1m6s ago   │ ERROR (0/3) │ in 8m53s       │
╰─────────────────────────────────────────────┴──────────────┴────────┴──────────────────┴─────────┴───────┴──────────────┴────────────┴─────────────┴────────────────╯

And after restart:

╭─────────────────────────────────────────────┬──────────────┬────────┬──────────────────┬─────────┬───────┬──────────────┬────────────┬────────┬────────────────╮
│ Task                                        │ Schedule     │ Window │ Timezone         │ Success │ Error │ Last Success │ Last Error │ Status │ Next           │
├─────────────────────────────────────────────┼──────────────┼────────┼──────────────────┼─────────┼───────┼──────────────┼────────────┼────────┼────────────────┤
│ healthcheck/alternator                      │ @every 15s   │        │ America/New_York │ 7       │ 0     │ 20s ago      │            │ DONE   │ in 8s          │
│ healthcheck/cql                             │ @every 15s   │        │ America/New_York │ 8       │ 0     │ 10s ago      │            │ DONE   │ in 8s          │
│ healthcheck/rest                            │ @every 1m0s  │        │ America/New_York │ 2       │ 0     │ 10s ago      │            │ DONE   │ in 53s         │
│ repair/all-weekly                           │ 0 23 * * SAT │        │ America/New_York │ 0       │ 0     │              │            │ NEW    │ in 6d15h52m49s │
│ repair/5f2554df-44f6-467d-81e4-3920069a5ee6 │ 1d           │        │ Europe/Warsaw    │ 0       │ 1     │              │ 1m35s ago  │ ERROR  │ in 11h40m6s    │
╰─────────────────────────────────────────────┴──────────────┴────────┴──────────────────┴─────────┴───────┴──────────────┴────────────┴────────┴────────────────╯

Is this expected behavior? Should SM preserve next task activation after restart? Or maybe it should schedule failed task to run right after restart, because perhaps now it will succeed?

@karol-kokoszka

Jan 15 '23 13:01 Michal-Leszczynski

I think the best would be to have @tzach input here. Let me copy link to this issue to the manager planning document.

Jan 19 '23 13:01 karol-kokoszka

Good catch. There are risks in starting all pending tasks when Manager restart:

cloud be more than one task, overload the system
the time of the day might be wrong, user would not want to start a task in bust hours

As such, I think the current is safer, but need to be documented. User can start an ad-hoc task if he is not happy with this logic.

Jan 22 '23 07:01 tzach

What is a good place to store this information in documentation? I see three candidates:

general sctool description
new subsection in sctool like: Scylla Manager server restart
in --retry and --retry-wait flag descriptions (because those flags are being "violated" in this case)

Feb 07 '23 15:02 Michal-Leszczynski

I sugget the first (general sctool description), with a link from the retry flags

Feb 07 '23 15:02 tzach

scylla-manager scylla-manager copied to clipboard

Scheduling of failed task after SM restart

scylla-manager
scylla-manager copied to clipboard