velero icon indicating copy to clipboard operation
velero copied to clipboard

error information enhancement for backup/restore

Open jerry-jibu opened this issue 2 years ago • 9 comments

Describe the problem/challenge you have

Through velero cli or CR, there is high level error message, such as PartiallyFailed when a backup or a restore fails. However for end users, it's still difficult to understand the actual reasons.

For example, it may be caused by some of node network restriction to access S3 or snapshot problem for some PVC backup.

Describe the solution you'd like

It's better to have some more concrete error messages which user can describe it from cli or CRs for next action.

Anything else you would like to add:

NA

Environment:

  • Velero version (use velero version):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • :+1: for "The project would be better with this feature added"
  • :-1: for "This feature will not enhance the project in a meaningful way"

jerry-jibu avatar Aug 11 '22 05:08 jerry-jibu

Thanks for your information.

The actual failure reasons can be found in logs. So a possible solution could be to extract error message from logs for now.

But It is convinent to acquire all the information from one place. So I think it's a good idea to include the specific failure reason in backup/restore CR. To this end, The error message need to be collected and recorded together with backup/restore phase at the same time. In current code, some error messages are scattered so it might takes some time to organize them. Besides, there should be a field in backup/restore CR used for storing these error messages.

allenxu404 avatar Sep 02 '22 12:09 allenxu404

@allenxu404 @jerry-jibu What you're looking for is already there. velero describe restore <restore-name> will list every error and warning. We don't add the full list with messages to the CR itself since there are kubernetes limits on how much we can add to status, and a large backup with a general failure that affects every resource would be too much. The per-backup/per-restore error list is stored in the BSL, and the describe command pulls it in and prints it out with the velero CR informaton.

sseago avatar Sep 02 '22 13:09 sseago

I agree that we don't want to put the errors directly into CR due to the concern over size, and storing error list in the BSL requires some design.

We'll move this out of v1.10 scope and put it in the backlog.

reasonerjt avatar Sep 07 '22 08:09 reasonerjt

@reasonerjt We already have the errors in the BSL -- velero describe lists the errors and warnings.

sseago avatar Sep 07 '22 13:09 sseago

@sseago I go through the code quickly and find seems only restore stores the error list in the BSL, backup doesn't. @allenxu404 Could you double-check this?

ywk253100 avatar Sep 08 '22 01:09 ywk253100

@sseago I go through the code quickly and find seems only restore stores the error list in the BSL, backup doesn't. @allenxu404 Could you double-check this?

Yes, the same as you metioned. I only find errors and warnings which cause partially fail are stored in restore but not in backup.

allenxu404 avatar Sep 08 '22 01:09 allenxu404

@ywk253100 Ahh, you are correct. I'm remembering this better now. At one point in the past we had an inconsistency where for backups, errors and warnings appeared in velero backup logs but not in velero backup describe, while for restores, errors and warnings appeared in velero restore describe but not in velero restore logs. I put a fix in to resolve the problem on the restore side -- adding the warnings/errors to logs so that they appeared in both logs and describe output, but the corresponding fix is still needed on the backup side. So yes, it looks like this issue can be resolved by adding the errors and warnings to the describe output.

Part of the challenge is that the design of backup and restore are different. Restore keeps an explicit error list throughout the process, so we can identify warnings and errors with namespaces, cluster-wide, etc. For backup, the various functions called do not return a structured warnings/errors list. The way velero determines the number of warnings/errors is to look at the number of warning/error messages sent to the backup log.

So I think there's a relatively easy fix that we should probably do first, and a more involved fix that we could consider in the future:

  1. The easy fix: when we get the error/warning messages from the log to update the Backup CR with the number of warnings/errors, we can take those messages and put them in the BSL like we do for restore, except we won't have them called out by namespace/etc. right now -- it will just be a flat list.
  2. The longer-term fix for the future: refactor the functions in the backup controller call stack to add explicit warnings/errors and return them, as we do for restore -- i.e. everywhere we have "log.error" or "log.warning", we also include "errs.add" and "warnings.add" -- and use a similar structured err/warning list to what's done for restore.

sseago avatar Sep 08 '22 13:09 sseago

So I think there's a relatively easy fix that we should probably do first, and a more involved fix that we could consider in the future:

1. The easy fix: when we get the error/warning messages from the log to update the Backup CR with the number of warnings/errors, we can take those messages and put them in the BSL like we do for restore, except we won't have them called out by namespace/etc. right now -- it will just be a flat list.

2. The longer-term fix for the future: refactor the functions in the backup controller call stack to add explicit warnings/errors and return them, as we do for restore -- i.e. everywhere we have "log.error" or "log.warning", we also include "errs.add" and "warnings.add" -- and use a similar structured err/warning list to what's done for restore.

@sseago This is exactly what we were thinking!

allenxu404 avatar Sep 09 '22 02:09 allenxu404

Thanks @sseago @allenxu404 . Yes, I think the short term easy fix will be very helpful for users to do self debug even without deep velero experience :)

jerry-jibu avatar Sep 18 '22 09:09 jerry-jibu