xen-orchestra icon indicating copy to clipboard operation
xen-orchestra copied to clipboard

[Backup | Job] Configurable number of retries

Open julien-f opened this issue 7 years ago • 9 comments

When a job call fails, it could be retried automatically for a configured number of times (and delay).

  • [ ] server
  • [ ] GUI

julien-f avatar May 05 '17 12:05 julien-f

What about retrying a failed VM inside a job? (eg one VM failed on 100) Is it related to this issue or I should create a new one?

olivierlambert avatar Mar 26 '18 12:03 olivierlambert

Perhaps just a bit to obvious but things that come to mind here;

  • only retry jobs which have a chance to work on a new run like timeouts (VDI chain protection failures should not be tried again I suppose)
  • Do expose (or make it optional) the occurred errors in reports/notifications so you do keep aware of possible bottlenecks which impact the backup schedules
  • Retry jobs after all normal jobs (append to end of the queue)

hkraal avatar Mar 26 '18 12:03 hkraal

  1. VDI chain protection could work if the coalesce finished just a bit after the first attempt. There is no way to guess when it's done BTW
  2. I'm not sure to get this one? You mean, despite a retry, we should hide the initial error, that's correct? If yes, makes sense, indeed
  3. Good point, we'll see how to do manage this

olivierlambert avatar Mar 26 '18 13:03 olivierlambert

I'm not sure to get this one? You mean, despite a retry, we should hide the initial error, that's correct? If yes, makes sense, indeed

A backup is intended to succeed in 1 time and should be treated as such. If backups fail initially but succeed at a retry the first error should still come forward as it might be indicative of deeper problems.

hkraal avatar Mar 26 '18 18:03 hkraal

I have no idea what to do with this?

  • should we retry for all errors?
  • should we retry the whole VM backup or only the failed step?
  • should we wait before retrying?
  • should we stop retry if the job is too old?

julien-f avatar Feb 27 '19 13:02 julien-f

@julien-f I think this issue predates the "retry all failed jobs" functionality which is nowadays present in XOA. Unless you have a clear reason against it I would be inclined to close this issue.

hkraal avatar Feb 27 '19 13:02 hkraal

Closing, we'll reopen if necessary :slightly_smiling_face:

julien-f avatar Feb 27 '19 14:02 julien-f

Is it still relevant? I know we do some retry somewhere, but IDK if it's exactly what we meant originally?

olivierlambert avatar Sep 14 '23 08:09 olivierlambert

It would be nice to have a automatic retry when a VM failed to backup. A variety of issues could automatically be resolved after some time. Looking at our own experience I would like to see a automatic retry of a VM when;

  • A coalesce process is still running (it might be caused by a manually removed snapshot which still processes at the start of de backup schedule
  • A merge worker is still busy on it's files
  • A timeout to xapi has occurred during the backup window

This would fix 99% of the job failures (if we see them at all) automatically. I would expect the job to be successful after any of the above occurrences but I do need to know which errors where automatically solved (e.g. I expect to see details in a backup report)

hkraal avatar Sep 14 '23 08:09 hkraal