velero
velero copied to clipboard
few backup which has emptyDir volume in pod stuck on openshift cluster using velero restic
What steps did you take and what happened: followed Version: v1.5.2 velero vesrion official article below is command to install velero on openshift
velero install --provider aws --bucket
then applied patch to restic daemon set
oc adm policy add-scc-to-user privileged -z velero -n velero oc patch ds/restic --namespace velero --type json -p '[{"op":"add","path":"/spec/template/spec/containers/0/securityContext","value": { "privileged": true}}]'
followed https://velero.io/docs/main/restic/
openshift is configured on AWS cloud
openshift version Client Version: 4.7.19 Server Version: 4.7.19
backup stuck in progress 11 items backed up then in progress only not moving
velero backup describe openshift-full-cluster-backup
Name: openshift-full-cluster-backup Namespace: velero Labels: velero.io/storage-location=default Annotations: velero.io/source-cluster-k8s-gitversion=v1.20.0+87cc9a4 velero.io/source-cluster-k8s-major-version=1 velero.io/source-cluster-k8s-minor-version=20
Phase: InProgress
Errors: 0 Warnings: 0
Namespaces: Included: * Excluded: velero
Resources:
Included: *
Excluded:
Label selector:
Storage Location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hooks:
Backup Format Version: 1.1.0
Started: 2022-07-08 14:09:59 +0000 UTC Completed: <n/a>
Expiration: 2022-08-07 14:09:59 +0000 UTC
Estimated total items to be backed up: 5237 Items backed up so far: 11
Velero-Native Snapshots:
Restic Backups (specify --details for more information): Completed: 1 New: 1
Cluster-scoped: auto
Label selector:
Storage Location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hooks:
Backup Format Version: 1.1.0
Started: 2022-07-08 14:09:59 +0000 UTC Completed: <n/a>
Expiration: 2022-08-07 14:09:59 +0000 UTC
Estimated total items to be backed up: 5237 Items backed up so far: 11
Resource List: <error getting backup resource list: timed out waiting for download URL>
Velero-Native Snapshots:
Restic Backups: Completed: openshift-adp/openshift-adp-controller-manager-56dc9468b7-nqcgh: bound-sa-token New: openshift-cloud-credential-operator/pod-identity-webhook-7fdfd9b5d8-5j6qv: webhook-certs
in openshift cluster i have 3 workernodes
NAME READY STATUS RESTARTS AGE restic-ddcbk 1/1 Running 0 7m17s restic-kcxgt 1/1 Running 0 7m15s restic-rmd7m 1/1 Running 0 7m14s velero-57fbc78b8c-f5gbn 1/1 Running 0 7m44s
logs from velero pod
time="2022-07-08T14:10:25Z" level=info msg="Processing item" backup=velero/openshift-full-cluster-backup logSource="pkg/backup/backup.go:378" name=pod-identity-webhook-7fdfd9b5d8-5j6qv namespace=openshift-cloud-credential-operator progress= resource=pods time="2022-07-08T14:10:25Z" level=info msg="Backing up item" backup=velero/openshift-full-cluster-backup logSource="pkg/backup/item_backupper.go:121" name=pod-identity-webhook-7fdfd9b5d8-5j6qv namespace=openshift-cloud-credential-operator resource=pods time="2022-07-08T14:10:25Z" level=info msg="Executing custom action" backup=velero/openshift-full-cluster-backup logSource="pkg/backup/item_backupper.go:327" name=pod-identity-webhook-7fdfd9b5d8-5j6qv namespace=openshift-cloud-credential-operator resource=pods time="2022-07-08T14:10:25Z" level=info msg="Executing podAction" backup=velero/openshift-full-cluster-backup cmd=/velero logSource="pkg/backup/pod_action.go:51" pluginName=velero time="2022-07-08T14:10:25Z" level=info msg="Done executing podAction" backup=velero/openshift-full-cluster-backup cmd=/velero logSource="pkg/backup/pod_action.go:77" pluginName=velero time="2022-07-08T14:10:25Z" level=info msg="Initializing restic repository" controller=restic-repository logSource="pkg/controller/restic_repository_controller.go:158" name=openshift-cloud-credential-operator-default-76lrp namespace=velero time="2022-07-08T14:10:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:10:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:10:29Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120" time="2022-07-08T14:11:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:11:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:11:29Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120" time="2022-07-08T14:12:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:12:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:12:29Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120" I0708 14:13:28.752015 1 request.go:621] Throttling request took 1.044197433s, request: GET:https://145.32.0.5:443/apis/console.openshift.io/v1alpha1?timeout=32s time="2022-07-08T14:13:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:13:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:13:29Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120" time="2022-07-08T14:14:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58"
What did you expect to happen: all backup should work fine
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
No
If you are using earlier versions:
1.5 version
-
velero backup describe <backupname>
velero backup describe openshift-full-cluster-backup
Name: openshift-full-cluster-backup Namespace: velero Labels: velero.io/storage-location=default Annotations: velero.io/source-cluster-k8s-gitversion=v1.20.0+87cc9a4 velero.io/source-cluster-k8s-major-version=1 velero.io/source-cluster-k8s-minor-version=20
Phase: InProgress
Errors: 0 Warnings: 0
Namespaces: Included: * Excluded: velero
Resources:
Included: *
Excluded:
Label selector:
Storage Location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hooks:
Backup Format Version: 1.1.0
Started: 2022-07-08 14:09:59 +0000 UTC Completed: <n/a>
Expiration: 2022-08-07 14:09:59 +0000 UTC
Estimated total items to be backed up: 5237 Items backed up so far: 11
Velero-Native Snapshots:
Restic Backups (specify --details for more information): Completed: 1 New: 1
Cluster-scoped: auto
Label selector:
Storage Location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hooks:
Backup Format Version: 1.1.0
Started: 2022-07-08 14:09:59 +0000 UTC Completed: <n/a>
Expiration: 2022-08-07 14:09:59 +0000 UTC
Estimated total items to be backed up: 5237 Items backed up so far: 11
Resource List: <error getting backup resource list: timed out waiting for download URL>
Velero-Native Snapshots:
Restic Backups: Completed: openshift-adp/openshift-adp-controller-manager-56dc9468b7-nqcgh: bound-sa-token New: openshift-cloud-credential-operator/pod-identity-webhook-7fdfd9b5d8-5j6qv: webhook-certs
-
velero backup logs <backupname>
- backups is still in progress so not allowed to see logs
-
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
- no restore performed
-
velero restore logs <restorename>
No restored performed
Anything else you would like to add: few of the namespaces backup completed but few namespaces backup stuck
Environment:
- Velero version (use
velero version
): - Client: Version: v1.5.2
Server: Version: v1.5.2
- Velero features (use
velero client config get features
): - Kubernetes version (use
kubectl version
): - 1.20 version
- Kubernetes installer & version:
- Cloud provider or hardware configuration:
- AWS
- OS (e.g. from
/etc/os-release
):
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
- :+1: for "I would like to see this bug fixed as soon as possible"
- :-1: for "There are more important bugs to focus on right now"
#restic #openshift
How long was the backup running? From the logs it looks like the backup was waiting for the BackupStorageLocation to be valid. Maybe an issue with your s3 bucket?
In any case, 1.5 is an old version of Velero. 1.9 was just released. It's probably better to try again with a newer version and see if you're still having problems.
@sseago so i have noted openshift's namespace few are backing up only those namespace stuck which pod has any emptyDir volume and those backup never end.like status will be in progress
so i have configured s3 bucket --bucket
velero get backup-location -n velero
NAME PROVIDER BUCKET/PREFIX PHASE LAST VALIDATED ACCESS MODE
default aws
below is the logs where you can see openshift namespace backup competed successfully (if location was issue then each backup must have issue but here its not the case only those namespace stuck which pod has emptyDir volume)
Namespaces:
Included: openshift-sdn, openshift-service-ca, openshift-vsphere-infra, openshift-kube-apiserver, openshift-etcd
Excluded:
Resources:
Included: *
Excluded:
Label selector:
Storage Location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hooks:
Backup Format Version: 1.1.0
Started: 2022-07-12 10:11:12 +0000 UTC Completed: 2022-07-12 10:12:07 +0000 UTC
Expiration: 2022-08-11 10:11:12 +0000 UTC
Total items to be backed up: 425
Items backed up: 425
Velero-Native Snapshots:
below is the backup which is stuck
Estimated total items to be backed up: 5237 Items backed up so far: 11
Resource List: <error getting backup resource list: timed out waiting for download URL>
Velero-Native Snapshots:
Restic Backups: Completed: openshift-adp/openshift-adp-controller-manager-56dc9468b7-jkdhfjksdjd: bound-sa-token New: ### openshift-cloud-credential-operator/pod-identity-webhook-hjyfgttttddd: webhook-certs
this webhook-cert is emptyDir volume type
And i have upgraded to 1.9 and aws plugin 1.5 issue was same but in velero's pod had no error until its reached to same emptyDir volume once its reached to those emptyDir volume it started showing below error
level=error msg="Error updating download request" controller=download-request downloadRequest=velero/backup-1-6fd8a471-1235-494g-237f-6dd312267829 error="downloadrequests.velero.io "backup-1-6fd8a471-1235-494g-237f-6dd312267829" not found"
I'm not really sure what that downloadrequest is referring to. Is that the name of your backup that it seems like it can't find? Also, you mentioned that the backup was stuck, but above I'm seeing a start and completion timestamp on the backup, so I'm not really sure what's going on. Restic should support emptydir backup, though. In any case, it looks like there's one restic volume that completed, and one is still in a New state. Look at the restic pod logs for the restic pod that's on the same node that the pod mounting the volume is on -- it could be that the pod is unhealthy. Also look at the PodVolumeBackup for that pod and volume.
@sseago yes , as i have mentioned few namespaces backup completed successfully but whichever pod has emptyDir volume it will stuck there.
i have shared both backup completed with --included few namespaces name and when i tried full backup it will stuck on emptyDir vol pod backup
and i can see pod status is running status
Were any of the pods that succeeded restic backups running on the same node as the failing pod? The PVB is in a "new" state still, which seems to indicate that Restic isn't even trying to back it up. If you look at the pod logs for the restic pod that should have processed the PVB, maybe there is some indication of what went wrong.
Hi @sseago yes succeeded and stuck both backup's pod are running on similar nodes
which pod is stuck that logs are clean i don't see any error
ubuntu@:~$ kubectl logs -f pod-identity-webhook-7fdfd9b5d8-t6s5q -n openshift-cloud-credential-operator W0714 04:14:02.115927 1 client_config.go:551] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0714 04:14:02.129849 1 store.go:61] Fetched secret: openshift-cloud-credential-operator/pod-identity-webhook I0714 04:14:02.130220 1 main.go:174] Creating server I0714 04:14:02.130331 1 main.go:194] Listening on :9999 for metrics and healthz I0714 04:14:02.130446 1 main.go:188] Listening on :6443
this is the logs from 1.9 velero's pod once backup stuck it will start showing below logs
time="2022-07-14T05:40:05Z" level=debug msg="waiting for stdio data" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio time="2022-07-14T05:40:05Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130" time="2022-07-14T05:40:05Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115" time="2022-07-14T05:40:05Z" level=debug msg="received EOF, stopping recv loop" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio time="2022-07-14T05:40:05Z" level=debug msg="plugin process exited" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws pid=1018 time="2022-07-14T05:40:05Z" level=debug msg="plugin exited" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75"
Same setup is working fine for Azure, Aws and GCP only i am facing issue with Openshift cluster
@Ankita5892 Could you also supply the restic pod logs for the pod that failed backup? You need to find the restic pod that's running on the same node as the pod with failed pod volume backups. It's not clear from the above whether there were successful and failed volumes on the same node. It might be worth also looking at the restic pod logs for the restic pod on the same node as a successful volume backup, if it's not the same restic pod.
@sseago so its openshift cluster so we have 3 master and 3 worker nodes and many pods are running on worker nodes and master node both but restic pods count is only 3 and they all are running on worker only
and which backup is stuck (in progress) that pod is also running on master node
OK. So yes, this is starting to make some more sense now. DaemonSets can't be scheduled on master nodes by default. In the OADP context, this is not normally a problem since the default openshift operators that run on master nodes are considered part of the control plane rather than user workloads, which is out-of-scope of the supported OADP use cases. You might be able to succeed in a restic backup of master-node volumes by modifying node taints or using a custom node selector to force the restic DaemonSet onto master nodes, but it's not a scenario that we've tested, and there's no guarantee you won't hit other problems when doing this.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Closing the stale issue.