velero icon indicating copy to clipboard operation
velero copied to clipboard

GCP: Backup marked completed, even though snapshots are not ready yet

Open boxcee opened this issue 5 years ago • 16 comments

What steps did you take and what happened: In our automation, I am creating backups on one cluster and then restore them on another within the same GCP project. To backup, I use:

velero backup create mybackup --storage-location gcp-s3 --include-namespaces mynamespace,myothernamespace --include-resources persistentvolumes --include-cluster-resources --snapshot-volumes --ttl 24h --wait

After some time the backup completes. However, when I check GCP -> Compute -> Snapshots, I see that not all snapshots are ready yet. When restoring this 'completed' backup, I get some error messages, so in my automation I included a step where I check all snapshots from the backup and only continue when all snapshots are marked as ready.

What did you expect to happen: When I pass --wait to the backup command I expect the backup to be usable and all involved steps to be done before reporting a 'completed' status.

The output of the following commands will help us better understand what's going on: velero restore describe myrestore --details

Name:         myrestore
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  PartiallyFailed (run 'velero restore logs myrestore' for more information)

Errors:
  Velero:     <none>
  Cluster:  error executing PVAction for persistentvolumes/pvc-442c6854-971d-4c0b-acff-ca4e840ccf0e: rpc error: code = Unknown desc = googleapi: Error 400: The resource 'projects/my-gcp-project/global/snapshots/cluster-pvc-ab9b674c-601f-411e-b565-4b244f1c9ce1' is not ready, resourceNotReady
  Namespaces: <none>

Backup:  mybackup

Namespaces:
  Included:  mynamespace, myothernamespace
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Environment:

  • Velero version (use velero version):
Client:
	Version: v1.2.0
	Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
	Version: v1.2.0 
  • Velero features (use velero client config get features): None
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:30:10Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.7-gke.23", GitCommit:"06e05fd0390a51ea009245a90363f9161b6f2389", GitTreeState:"clean", BuildDate:"2020-01-17T23:10:45Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version: GKE
  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): Ubuntu 18.04

boxcee avatar Feb 21 '20 08:02 boxcee

@boxcee yeah we're aware of this issue - see https://github.com/vmware-tanzu/velero/issues/1799 for another report (on AWS, but same issue).

skriss avatar Feb 21 '20 16:02 skriss

ideally we'd model this as an additional phase on the backup, to indicate that the snapshots have been created but are not yet ready.

skriss avatar Feb 21 '20 16:02 skriss

Why not flag it as 'BackupInProgress' ?

boxcee avatar Feb 21 '20 17:02 boxcee

I think it'd be useful to differentiate between "we're actively scraping the API to create this backup" vs. "we're waiting for the storage system to finish moving the snapshot data to durable storage" - more clear for users as to what's going on, and also likely makes some things easier on the back end (e.g. not blocking the backup controller queue if we're just waiting for snapshots to be ready).

But, all of this probably needs some more thought and design. If you're interested in working on this, we're happy to provide feedback.

skriss avatar Feb 21 '20 18:02 skriss

Hm, I am only concerned when it comes to using the --wait flag. From a user point of view I expect the backup to be finished when I explicitly add the --wait flag, but it isn't. The backup is not really 'Completed', only parts are.

Anyway, I am interested in finding a solution for this. Will spend some time looking into this.

boxcee avatar Feb 24 '20 11:02 boxcee

Ah, another thing I experienced, which is indirectly related.

When the storage quotas are reached and a backup and snapshots are being created the snapshots will silently fail.

Am trying to create a setup and test this further.

boxcee avatar Feb 24 '20 11:02 boxcee

When the storage quotas are reached and a backup and snapshots are being created the snapshots will silently fail.

Possibly related to #2212 or #2255?

skriss avatar Feb 24 '20 15:02 skriss

@skriss btw regarding https://github.com/vmware-tanzu/velero/issues/1799 -- while we initially found it on AWS, the same problem showed up on Azure and GCP.

sseago avatar Mar 02 '20 16:03 sseago

@skriss I know we'd discussed options around doing this before, but we ended up fixing it in our fork (temporarily). I'm happy to remove our local commits if an upstream solution is found. Here's what we put in place: https://github.com/konveyor/velero-plugin-for-aws/pull/2 https://github.com/konveyor/velero-plugin-for-gcp/pull/2 https://github.com/konveyor/velero-plugin-for-microsoft-azure/pull/2

sseago avatar Mar 02 '20 16:03 sseago

Nice @sseago, unfortunately we don't want to use our own Fork so we want to place the code outside of Velero until Velero supports it.

@skriss do you know how we could get the Volume Snapshot ID to check in GKE without hooking into Velero directly?

If we run:

  backup, err := b.vc.VeleroV1().Backups(backup.Namespace).Get(backup)

I think we only get the BackupSpec which I don't think has the VolumeSnapshot ID in GKE. Can you point me in the right direction with the API to get the ID so we can monitor it ourselves before running a restore?

arianitu avatar May 05 '20 16:05 arianitu

Should be covered by #3533.

eleanor-millman avatar May 11 '21 20:05 eleanor-millman

Any update on this when 1.7.0 is released already?

son-la avatar Sep 30 '21 15:09 son-la

@son-la Just released yesterday! https://github.com/vmware-tanzu/velero/releases/tag/v1.7.0

eleanor-millman avatar Oct 01 '21 14:10 eleanor-millman

This issue depends on the work in upload progress monitor which is targeting v1.8. Updating the milestone to v1.8.

reasonerjt avatar Nov 03 '21 01:11 reasonerjt

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 16 '22 01:01 stale[bot]

Is this to be scheduled for 1.10?

svrc avatar Jul 28 '22 14:07 svrc

Hello, Not sure if it's related but i noticed something similar with GCP and the snapshot quotas. By default the quota is 25k snapshots authorized. If you hit this quota, velero will continue to backup and snapshot disks but nothing will happen on GCP side. Even worse, all the snapshot IDs that are present in the velero describe command are wrong and doesn't exist on GCP.

Pea-owkin avatar Jun 29 '23 14:06 Pea-owkin

@Pea-owkin, #6438 sounds different from this issue. There, snapshot operation should fail while in this issue, we are talking about snapshot operation taking time to complete (based on my quick read).

draghuram avatar Jun 29 '23 15:06 draghuram