xen-orchestra [Backup | Job] Configurable number of retries

When a job call fails, it could be retried automatically for a configured number of times (and delay).

[ ] server
[ ] GUI

May 05 '17 12:05 julien-f

What about retrying a failed VM inside a job? (eg one VM failed on 100) Is it related to this issue or I should create a new one?

Mar 26 '18 12:03 olivierlambert

Perhaps just a bit to obvious but things that come to mind here;

only retry jobs which have a chance to work on a new run like timeouts (VDI chain protection failures should not be tried again I suppose)
Do expose (or make it optional) the occurred errors in reports/notifications so you do keep aware of possible bottlenecks which impact the backup schedules
Retry jobs after all normal jobs (append to end of the queue)

Mar 26 '18 12:03 hkraal

VDI chain protection could work if the coalesce finished just a bit after the first attempt. There is no way to guess when it's done BTW
I'm not sure to get this one? You mean, despite a retry, we should hide the initial error, that's correct? If yes, makes sense, indeed
Good point, we'll see how to do manage this

Mar 26 '18 13:03 olivierlambert

I'm not sure to get this one? You mean, despite a retry, we should hide the initial error, that's correct? If yes, makes sense, indeed

A backup is intended to succeed in 1 time and should be treated as such. If backups fail initially but succeed at a retry the first error should still come forward as it might be indicative of deeper problems.

Mar 26 '18 18:03 hkraal

I have no idea what to do with this?

should we retry for all errors?
should we retry the whole VM backup or only the failed step?
should we wait before retrying?
should we stop retry if the job is too old?

Feb 27 '19 13:02 julien-f

@julien-f I think this issue predates the "retry all failed jobs" functionality which is nowadays present in XOA. Unless you have a clear reason against it I would be inclined to close this issue.

Feb 27 '19 13:02 hkraal

Closing, we'll reopen if necessary :slightly_smiling_face:

Feb 27 '19 14:02 julien-f

Is it still relevant? I know we do some retry somewhere, but IDK if it's exactly what we meant originally?

Sep 14 '23 08:09 olivierlambert

It would be nice to have a automatic retry when a VM failed to backup. A variety of issues could automatically be resolved after some time. Looking at our own experience I would like to see a automatic retry of a VM when;

A coalesce process is still running (it might be caused by a manually removed snapshot which still processes at the start of de backup schedule
A merge worker is still busy on it's files
A timeout to xapi has occurred during the backup window

This would fix 99% of the job failures (if we see them at all) automatically. I would expect the job to be successful after any of the above occurrences but I do need to know which errors where automatically solved (e.g. I expect to see details in a backup report)

Sep 14 '23 08:09 hkraal

xen-orchestra xen-orchestra copied to clipboard

[Backup | Job] Configurable number of retries

xen-orchestra
xen-orchestra copied to clipboard