xen-orchestra icon indicating copy to clipboard operation
xen-orchestra copied to clipboard

feat: retry if VM backup fails

Open MathieuRA opened this issue 1 year ago • 2 comments

Screenshots

Capture d’écran de 2024-02-09 16-39-45 Capture d’écran de 2024-02-22 15-13-16

Description

Do not squash

Fixes #2139

Note that if a backup succeeds after at least one retry, its status will be considered "failed" rather than "success". (due do one bug already present on the code but is now visible with the retry). No change occurs if the backup succeeds on the first attempt.

Checklist

  • Commit
    • Title follows commit conventions
    • Reference the relevant issue (Fixes #007, See xoa-support#42, See https://...)
    • If bug fix, add Introduced by
  • Changelog
    • If visible by XOA users, add changelog entry
    • Update "Packages to release" in CHANGELOG.unreleased.md
  • PR
    • If UI changes, add screenshots
    • If not finished or not tested, open as Draft

MathieuRA avatar Jan 15 '24 08:01 MathieuRA

I would like to ensure the retry code doesn't create any spare snapshot. Note that this problem, if confirmed, is already present, and would only be amplified by the retry strategy

@MathieuRA could you test this : a job with rolling snapshot and incremental backup, that works 3 time Then you throw an error during transfer and let the retry fails 5 times. Ideally, the older rolling snapshot shouldn't be overwritten by the 5 of the failing transfer

If confirmed, I think we can either try to reuse the snapshot ( more coupling, but better performance) or delete the snapshot on fail during _run() (decoupled but slower)

fbeauchamp avatar Jan 16 '24 16:01 fbeauchamp

  1. if we add the retry around one VM backup :
  • it will be easy to implement
  • it won't retry if at least one remote succeeded ( could be alleviated by passing a flag to the vm runner)
  • if will lead to incorrect snapshot, backup and replication retention depending on which remote failed
  • it will do a full snapshot and transfer cycle
  • logs will be marked as failed
  • _isAlreadyTransferred should not retransfer a successfull incremental vhd to a backup , may be extended to handle a successful replication ?
  1. if we add a retry in callWriter :
  • it will retry a failing healthcheck
  • the transfer won't be able to be restarted since the stream will already be consumed ( and only NBD could be modified to restart a stream)
  1. i think we should add it around _copy (for incremental runner) _run (for full runner) :
  • retry n time only with the failed writer,
  • on success reset the writers list to the list of writers that has succeed at least once
  • if there is no more writers : throw

fbeauchamp avatar Jan 23 '24 09:01 fbeauchamp