velero
velero copied to clipboard
GCP: Backup marked completed, even though snapshots are not ready yet
What steps did you take and what happened: In our automation, I am creating backups on one cluster and then restore them on another within the same GCP project. To backup, I use:
velero backup create mybackup --storage-location gcp-s3 --include-namespaces mynamespace,myothernamespace --include-resources persistentvolumes --include-cluster-resources --snapshot-volumes --ttl 24h --wait
After some time the backup completes. However, when I check GCP -> Compute -> Snapshots, I see that not all snapshots are ready yet. When restoring this 'completed' backup, I get some error messages, so in my automation I included a step where I check all snapshots from the backup and only continue when all snapshots are marked as ready.
What did you expect to happen:
When I pass --wait to the backup command I expect the backup to be usable and all involved steps to be done before reporting a 'completed' status.
The output of the following commands will help us better understand what's going on:
velero restore describe myrestore --details
Name: myrestore
Namespace: velero
Labels: <none>
Annotations: <none>
Phase: PartiallyFailed (run 'velero restore logs myrestore' for more information)
Errors:
Velero: <none>
Cluster: error executing PVAction for persistentvolumes/pvc-442c6854-971d-4c0b-acff-ca4e840ccf0e: rpc error: code = Unknown desc = googleapi: Error 400: The resource 'projects/my-gcp-project/global/snapshots/cluster-pvc-ab9b674c-601f-411e-b565-4b244f1c9ce1' is not ready, resourceNotReady
Namespaces: <none>
Backup: mybackup
Namespaces:
Included: mynamespace, myothernamespace
Excluded: <none>
Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
Cluster-scoped: auto
Namespace mappings: <none>
Label selector: <none>
Restore PVs: auto
Environment:
- Velero version (use
velero version):
Client:
Version: v1.2.0
Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
Version: v1.2.0
- Velero features (use
velero client config get features): None - Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:30:10Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.7-gke.23", GitCommit:"06e05fd0390a51ea009245a90363f9161b6f2389", GitTreeState:"clean", BuildDate:"2020-01-17T23:10:45Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
- Kubernetes installer & version: GKE
- Cloud provider or hardware configuration: GKE
- OS (e.g. from
/etc/os-release): Ubuntu 18.04
@boxcee yeah we're aware of this issue - see https://github.com/vmware-tanzu/velero/issues/1799 for another report (on AWS, but same issue).
ideally we'd model this as an additional phase on the backup, to indicate that the snapshots have been created but are not yet ready.
Why not flag it as 'BackupInProgress' ?
I think it'd be useful to differentiate between "we're actively scraping the API to create this backup" vs. "we're waiting for the storage system to finish moving the snapshot data to durable storage" - more clear for users as to what's going on, and also likely makes some things easier on the back end (e.g. not blocking the backup controller queue if we're just waiting for snapshots to be ready).
But, all of this probably needs some more thought and design. If you're interested in working on this, we're happy to provide feedback.
Hm, I am only concerned when it comes to using the --wait flag. From a user point of view I expect the backup to be finished when I explicitly add the --wait flag, but it isn't. The backup is not really 'Completed', only parts are.
Anyway, I am interested in finding a solution for this. Will spend some time looking into this.
Ah, another thing I experienced, which is indirectly related.
When the storage quotas are reached and a backup and snapshots are being created the snapshots will silently fail.
Am trying to create a setup and test this further.
When the storage quotas are reached and a backup and snapshots are being created the snapshots will silently fail.
Possibly related to #2212 or #2255?
@skriss btw regarding https://github.com/vmware-tanzu/velero/issues/1799 -- while we initially found it on AWS, the same problem showed up on Azure and GCP.
@skriss I know we'd discussed options around doing this before, but we ended up fixing it in our fork (temporarily). I'm happy to remove our local commits if an upstream solution is found. Here's what we put in place: https://github.com/konveyor/velero-plugin-for-aws/pull/2 https://github.com/konveyor/velero-plugin-for-gcp/pull/2 https://github.com/konveyor/velero-plugin-for-microsoft-azure/pull/2
Nice @sseago, unfortunately we don't want to use our own Fork so we want to place the code outside of Velero until Velero supports it.
@skriss do you know how we could get the Volume Snapshot ID to check in GKE without hooking into Velero directly?
If we run:
backup, err := b.vc.VeleroV1().Backups(backup.Namespace).Get(backup)
I think we only get the BackupSpec which I don't think has the VolumeSnapshot ID in GKE. Can you point me in the right direction with the API to get the ID so we can monitor it ourselves before running a restore?
Should be covered by #3533.
Any update on this when 1.7.0 is released already?
@son-la Just released yesterday! https://github.com/vmware-tanzu/velero/releases/tag/v1.7.0
This issue depends on the work in upload progress monitor which is targeting v1.8. Updating the milestone to v1.8.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is this to be scheduled for 1.10?
Hello, Not sure if it's related but i noticed something similar with GCP and the snapshot quotas. By default the quota is 25k snapshots authorized. If you hit this quota, velero will continue to backup and snapshot disks but nothing will happen on GCP side. Even worse, all the snapshot IDs that are present in the velero describe command are wrong and doesn't exist on GCP.
@Pea-owkin, #6438 sounds different from this issue. There, snapshot operation should fail while in this issue, we are talking about snapshot operation taking time to complete (based on my quick read).