xen-orchestra
xen-orchestra copied to clipboard
[Backup | Job] Configurable number of retries
When a job call fails, it could be retried automatically for a configured number of times (and delay).
- [ ] server
- [ ] GUI
What about retrying a failed VM inside a job? (eg one VM failed on 100) Is it related to this issue or I should create a new one?
Perhaps just a bit to obvious but things that come to mind here;
- only retry jobs which have a chance to work on a new run like timeouts (VDI chain protection failures should not be tried again I suppose)
- Do expose (or make it optional) the occurred errors in reports/notifications so you do keep aware of possible bottlenecks which impact the backup schedules
- Retry jobs after all normal jobs (append to end of the queue)
- VDI chain protection could work if the coalesce finished just a bit after the first attempt. There is no way to guess when it's done BTW
- I'm not sure to get this one? You mean, despite a retry, we should hide the initial error, that's correct? If yes, makes sense, indeed
- Good point, we'll see how to do manage this
I'm not sure to get this one? You mean, despite a retry, we should hide the initial error, that's correct? If yes, makes sense, indeed
A backup is intended to succeed in 1 time and should be treated as such. If backups fail initially but succeed at a retry the first error should still come forward as it might be indicative of deeper problems.
I have no idea what to do with this?
- should we retry for all errors?
- should we retry the whole VM backup or only the failed step?
- should we wait before retrying?
- should we stop retry if the job is too old?
@julien-f I think this issue predates the "retry all failed jobs" functionality which is nowadays present in XOA. Unless you have a clear reason against it I would be inclined to close this issue.
Closing, we'll reopen if necessary :slightly_smiling_face:
Is it still relevant? I know we do some retry somewhere, but IDK if it's exactly what we meant originally?
It would be nice to have a automatic retry when a VM failed to backup. A variety of issues could automatically be resolved after some time. Looking at our own experience I would like to see a automatic retry of a VM when;
- A coalesce process is still running (it might be caused by a manually removed snapshot which still processes at the start of de backup schedule
- A merge worker is still busy on it's files
- A timeout to xapi has occurred during the backup window
This would fix 99% of the job failures (if we see them at all) automatically. I would expect the job to be successful after any of the above occurrences but I do need to know which errors where automatically solved (e.g. I expect to see details in a backup report)