xen-orchestra feat: retry if VM backup fails

Screenshots

Capture d’écran de 2024-02-09 16-39-45 Capture d’écran de 2024-02-22 15-13-16

Description

Do not squash

Fixes #2139

Note that if a backup succeeds after at least one retry, its status will be considered "failed" rather than "success". (due do one bug already present on the code but is now visible with the retry). No change occurs if the backup succeeds on the first attempt.

Checklist

Commit
- Title follows commit conventions
- Reference the relevant issue (Fixes #007, See xoa-support#42, See https://...)
- If bug fix, add Introduced by
Changelog
- If visible by XOA users, add changelog entry
- Update "Packages to release" in CHANGELOG.unreleased.md
PR
- If UI changes, add screenshots
- If not finished or not tested, open as Draft

Jan 15 '24 08:01 MathieuRA

I would like to ensure the retry code doesn't create any spare snapshot. Note that this problem, if confirmed, is already present, and would only be amplified by the retry strategy

@MathieuRA could you test this : a job with rolling snapshot and incremental backup, that works 3 time Then you throw an error during transfer and let the retry fails 5 times. Ideally, the older rolling snapshot shouldn't be overwritten by the 5 of the failing transfer

If confirmed, I think we can either try to reuse the snapshot ( more coupling, but better performance) or delete the snapshot on fail during _run() (decoupled but slower)

Jan 16 '24 16:01 fbeauchamp

if we add the retry around one VM backup :

it will be easy to implement
it won't retry if at least one remote succeeded ( could be alleviated by passing a flag to the vm runner)
if will lead to incorrect snapshot, backup and replication retention depending on which remote failed
it will do a full snapshot and transfer cycle
logs will be marked as failed
_isAlreadyTransferred should not retransfer a successfull incremental vhd to a backup , may be extended to handle a successful replication ?

if we add a retry in callWriter :

it will retry a failing healthcheck
the transfer won't be able to be restarted since the stream will already be consumed ( and only NBD could be modified to restart a stream)

i think we should add it around _copy (for incremental runner) _run (for full runner) :

retry n time only with the failed writer,
on success reset the writers list to the list of writers that has succeed at least once
if there is no more writers : throw

Jan 23 '24 09:01 fbeauchamp

xen-orchestra xen-orchestra copied to clipboard

feat: retry if VM backup fails

Screenshots

Description

Checklist

xen-orchestra
xen-orchestra copied to clipboard