velero
velero copied to clipboard
error information enhancement for backup/restore
Describe the problem/challenge you have
Through velero cli or CR, there is high level error message, such as PartiallyFailed
when a backup or a restore fails.
However for end users, it's still difficult to understand the actual reasons.
For example, it may be caused by some of node network restriction to access S3 or snapshot problem for some PVC backup.
Describe the solution you'd like
It's better to have some more concrete error messages which user can describe it from cli or CRs for next action.
Anything else you would like to add:
NA
Environment:
- Velero version (use
velero version
): - Kubernetes version (use
kubectl version
): - Kubernetes installer & version:
- Cloud provider or hardware configuration:
- OS (e.g. from
/etc/os-release
):
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
- :+1: for "The project would be better with this feature added"
- :-1: for "This feature will not enhance the project in a meaningful way"
Thanks for your information.
The actual failure reasons can be found in logs. So a possible solution could be to extract error message from logs for now.
But It is convinent to acquire all the information from one place. So I think it's a good idea to include the specific failure reason in backup/restore CR. To this end, The error message need to be collected and recorded together with backup/restore phase at the same time. In current code, some error messages are scattered so it might takes some time to organize them. Besides, there should be a field in backup/restore CR used for storing these error messages.
@allenxu404 @jerry-jibu What you're looking for is already there. velero describe restore <restore-name>
will list every error and warning. We don't add the full list with messages to the CR itself since there are kubernetes limits on how much we can add to status, and a large backup with a general failure that affects every resource would be too much. The per-backup/per-restore error list is stored in the BSL, and the describe command pulls it in and prints it out with the velero CR informaton.
I agree that we don't want to put the errors directly into CR due to the concern over size, and storing error list in the BSL requires some design.
We'll move this out of v1.10 scope and put it in the backlog.
@reasonerjt We already have the errors in the BSL -- velero describe
lists the errors and warnings.
@sseago I go through the code quickly and find seems only restore stores the error list in the BSL, backup doesn't. @allenxu404 Could you double-check this?
@sseago I go through the code quickly and find seems only restore stores the error list in the BSL, backup doesn't. @allenxu404 Could you double-check this?
Yes, the same as you metioned. I only find errors and warnings which cause partially fail are stored in restore but not in backup.
@ywk253100 Ahh, you are correct. I'm remembering this better now. At one point in the past we had an inconsistency where for backups, errors and warnings appeared in velero backup logs
but not in velero backup describe
, while for restores, errors and warnings appeared in velero restore describe
but not in velero restore logs
. I put a fix in to resolve the problem on the restore side -- adding the warnings/errors to logs so that they appeared in both logs and describe output, but the corresponding fix is still needed on the backup side. So yes, it looks like this issue can be resolved by adding the errors and warnings to the describe output.
Part of the challenge is that the design of backup and restore are different. Restore keeps an explicit error list throughout the process, so we can identify warnings and errors with namespaces, cluster-wide, etc. For backup, the various functions called do not return a structured warnings/errors list. The way velero determines the number of warnings/errors is to look at the number of warning/error messages sent to the backup log.
So I think there's a relatively easy fix that we should probably do first, and a more involved fix that we could consider in the future:
- The easy fix: when we get the error/warning messages from the log to update the Backup CR with the number of warnings/errors, we can take those messages and put them in the BSL like we do for restore, except we won't have them called out by namespace/etc. right now -- it will just be a flat list.
- The longer-term fix for the future: refactor the functions in the backup controller call stack to add explicit warnings/errors and return them, as we do for restore -- i.e. everywhere we have "log.error" or "log.warning", we also include "errs.add" and "warnings.add" -- and use a similar structured err/warning list to what's done for restore.
So I think there's a relatively easy fix that we should probably do first, and a more involved fix that we could consider in the future:
1. The easy fix: when we get the error/warning messages from the log to update the Backup CR with the number of warnings/errors, we can take those messages and put them in the BSL like we do for restore, except we won't have them called out by namespace/etc. right now -- it will just be a flat list. 2. The longer-term fix for the future: refactor the functions in the backup controller call stack to add explicit warnings/errors and return them, as we do for restore -- i.e. everywhere we have "log.error" or "log.warning", we also include "errs.add" and "warnings.add" -- and use a similar structured err/warning list to what's done for restore.
@sseago This is exactly what we were thinking!
Thanks @sseago @allenxu404 . Yes, I think the short term easy fix
will be very helpful for users to do self debug even without deep velero experience :)