stash icon indicating copy to clipboard operation
stash copied to clipboard

VolumeSnapshot fails with: resource name may not be empty

Open asoltesz opened this issue 5 years ago • 12 comments

Version: 0.9.0-rc.6

Im trying to backup a standalone PVC with a simple backup configuration, following the guide:

apiVersion: stash.appscode.com/v1beta1
kind: BackupConfiguration
metadata:
    name: default
spec:
    schedule: "*/2 * * * *"
    driver: VolumeSnapshotter
    target:
        ref:
            apiVersion: v1
            kind: PersistentVolumeClaim
            name: pgadmin
        snapshotClassName: csi-rbdplugin-snapclass
    retentionPolicy:     {
        name: "default",
        keepDaily: 7,
        keepWeekly: 4,
        keepMonthly: 6,
        prune: true
    }

The backup fails with these operator logs:

I0527 21:44:05.350737       1 jobs.go:68] Sync/Add/Update for Job stash-backup-default-1590615840

I0527 21:44:05.393717       1 jobs.go:68] Sync/Add/Update for Job stash-backup-default-1590615840

I0527 21:44:06.744865       1 backup_session.go:104] Sync/Add/Update for BackupSession default-1590615846

I0527 21:44:06.773364       1 job.go:36] Creating Job pgadmin/stash-vs-pvc-pgadmin-1590615846.

W0527 21:44:06.809239       1 backup_session.go:544] failed to ensure backup job. Reason:  resource name may not be empty

W0527 21:44:06.895426       1 backup_session.go:99] BackupSession pgadmin/default-1590615726 does not exist anymore

E0527 21:44:06.895543       1 worker.go:92] Failed to process key pgadmin/default-1590615846. Reason: resource name may not be empty

I0527 21:44:06.895560       1 worker.go:96] Error syncing key pgadmin/default-1590615846: resource name may not be empty

I0527 21:44:06.895604       1 backup_session.go:104] Sync/Add/Update for BackupSession default-1590615846

I0527 21:44:06.895669       1 backup_session.go:112] Skipping processing BackupSession pgadmin/default-1590615846. Reason: phase is "Failed".

I0527 21:44:06.900750       1 backup_session.go:104] Sync/Add/Update for BackupSession default-1590615846

I0527 21:44:06.900793       1 backup_session.go:112] Skipping processing BackupSession pgadmin/default-1590615846. Reason: phase is "Failed".

I0527 21:44:07.585139       1 jobs.go:68] Sync/Add/Update for Job stash-backup-default-1590615840

I0527 21:44:07.585226       1 jobs.go:71] Deleting succeeded job stash-backup-default-1590615840

I0527 21:44:07.600267       1 jobs.go:82] Deleted stash job: stash-backup-default-1590615840

W0527 21:44:07.601029       1 jobs.go:64] Job pgadmin/stash-backup-default-1590615840 does not exist anymore

The Rook RBD snapshot class is present:

kubectl  get volumesnapshotclasses
NAME                      AGE
csi-rbdplugin-snapclass   6h12m

asoltesz avatar May 27 '20 21:05 asoltesz

Any idea when this will be investigated / fixed?

Seems like a serious issue for an rc6 version state.

asoltesz avatar May 29 '20 15:05 asoltesz

@asoltesz Can you please try latest build from master?

helm install stash-operator appscode/stash \
  --version v0.9.0-rc.6 \
  --namespace kube-system \
  --set operator.registry=appscodeci \
  --set operator.tag=v0.9.0-rc.6-30-gae2d74fa_linux_amd64

hossainemruz avatar May 31 '20 03:05 hossainemruz

@hossainemruz I have tried with the version you recommended.

There are no error messages now in the operator log:

I0601 20:26:08.225724       1 jobs.go:69] Sync/Add/Update for Job stash-backup-default-1591043160

I0601 20:26:08.225751       1 jobs.go:72] Deleting succeeded job stash-backup-default-1591043160

I0601 20:26:08.254625       1 jobs.go:83] Deleted stash job: stash-backup-default-1591043160

W0601 20:26:08.254662       1 jobs.go:65] Job pgadmin/stash-backup-default-1591043160 does not exist anymore

I0601 20:26:14.404786       1 pvc.go:56] Sync/Add/Update for PersistentVolumeClaim pgadmin/pgadmin

I0601 20:26:21.567866       1 pvc.go:56] Sync/Add/Update for PersistentVolumeClaim pgadmin/pgadmin

However, the snapshot has not been created because the "kubectl get volumesnapshot --all-namespaces" gets nothing back.

The description of the BackupConfiguration

kubectl describe backupconfiguration default
Name:         default
Namespace:    pgadmin
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"stash.appscode.com/v1beta1","kind":"BackupConfiguration","metadata":{"annotations":{},"name":"default","namespace":"pgadmin...
API Version:  stash.appscode.com/v1beta1
Kind:         BackupConfiguration
Metadata:
  Creation Timestamp:  2020-06-01T20:11:58Z
  Finalizers:
    stash.appscode.com
  Generation:        2
  Resource Version:  16695
  Self Link:         /apis/stash.appscode.com/v1beta1/namespaces/pgadmin/backupconfigurations/default
  UID:               dddee1c9-d7d4-45d0-8906-8dbf481e54a2
Spec:
  Driver:  VolumeSnapshotter
  Paused:  false
  Retention Policy:
    Keep Daily:    7
    Keep Monthly:  6
    Keep Weekly:   4
    Name:          default
    Prune:         true
  Schedule:        */2 * * * *
  Target:
    Ref:
      API Version:        v1
      Kind:               PersistentVolumeClaim
      Name:               pgadmin
    Snapshot Class Name:  csi-rbdplugin-snapclass
Status:
  Conditions:
    Last Transition Time:  2020-06-01T20:11:58Z
    Message:               Backup target v1 persistentvolumeclaim/pgadmin found.
    Reason:                TargetAvailable
    Status:                True
    Type:                  BackupTargetFound
    Last Transition Time:  2020-06-01T20:11:58Z
    Message:               Successfully created backup triggering CronJob.
    Reason:                CronJobCreationSucceeded
    Status:                True
    Type:                  CronJobCreated
  Observed Generation:     2
Events:                    <none>

Note: I use Kubernetes 1.15.11 with the "snapshot.storage.k8s.io/v1alpha1" API.

There is now a v1beta1 API also. Can this be a problem? Which one does Stash target?

I can create a VolumeSnapshot manually like this:

apiVersion: snapshot.storage.k8s.io/v1alpha1
kind: VolumeSnapshot
metadata:
    name: pgadmin-backup
    namespace: pgadmin
spec:
    snapshotClassName: csi-rbdplugin-snapclass
    source:
        name: pgadmin
        kind: PersistentVolumeClaim

The resulting VolumeSnapshot and its VolumeSnapshotContent is visible with kubectl.

asoltesz avatar Jun 01 '20 20:06 asoltesz

Can you describe any BackupSession?

hossainemruz avatar Jun 01 '20 20:06 hossainemruz

I believe this belonged to the first execution today:

kubectl describe backupsession default-1591042691
Name:         default-1591042691
Namespace:    pgadmin
Labels:       app.kubernetes.io/component=stash-backup
              app.kubernetes.io/managed-by=stash.appscode.com
              stash.appscode.com/invoker-name=default
              stash.appscode.com/invoker-type=BackupConfiguration
Annotations:  <none>
API Version:  stash.appscode.com/v1beta1
Kind:         BackupSession
Metadata:
  Creation Timestamp:  2020-06-01T20:18:11Z
  Generation:          1
  Owner References:
    API Version:           stash.appscode.com/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  BackupConfiguration
    Name:                  default
    UID:                   dddee1c9-d7d4-45d0-8906-8dbf481e54a2
  Resource Version:        16804
  Self Link:               /apis/stash.appscode.com/v1beta1/namespaces/pgadmin/backupsessions/default-1591042691
  UID:                     d6659208-dc83-442f-8f24-5cc7eb0a6e80
Spec:
  Invoker:
    API Group:  stash.appscode.com
    Kind:       BackupConfiguration
    Name:       default
Status:
  Phase:  Running
  Targets:
    Phase:  Running
    Ref:
      Kind:       PersistentVolumeClaim
      Name:       pgadmin
    Total Hosts:  1
Events:
  Type    Reason                 Age   From                      Message
  ----    ------                 ----  ----                      -------
  Normal  BackupSession Running  28m   BackupSession Controller  Backup job has been created succesfully/sidecar is watching the BackupSession.

asoltesz avatar Jun 01 '20 20:06 asoltesz

Any log from the backup job?

hossainemruz avatar Jun 01 '20 20:06 hossainemruz

Here they are:

kc logs stash-vs-pvc-pgadmin-1591044603-xtp97
I0601 20:50:04.574623       1 log.go:181] FLAG: --alsologtostderr="false"
I0601 20:50:04.574772       1 log.go:181] FLAG: --backupsession="default-1591044603"
I0601 20:50:04.574781       1 log.go:181] FLAG: --bypass-validating-webhook-xray="false"
I0601 20:50:04.574789       1 log.go:181] FLAG: --enable-analytics="true"
I0601 20:50:04.574797       1 log.go:181] FLAG: --help="false"
I0601 20:50:04.574805       1 log.go:181] FLAG: --kubeconfig=""
I0601 20:50:04.574811       1 log.go:181] FLAG: --log-flush-frequency="5s"
I0601 20:50:04.574820       1 log.go:181] FLAG: --log_backtrace_at=":0"
I0601 20:50:04.574835       1 log.go:181] FLAG: --log_dir=""
I0601 20:50:04.574842       1 log.go:181] FLAG: --logtostderr="true"
I0601 20:50:04.575139       1 log.go:181] FLAG: --master=""
I0601 20:50:04.575176       1 log.go:181] FLAG: --metrics-enabled="true"
I0601 20:50:04.575189       1 log.go:181] FLAG: --pushgateway-url="http://rolling-quetzal-stash.kube-system.svc:56789"
I0601 20:50:04.575197       1 log.go:181] FLAG: --service-name="stash-operator"
I0601 20:50:04.575206       1 log.go:181] FLAG: --stderrthreshold="0"
I0601 20:50:04.575214       1 log.go:181] FLAG: --target-kind="PersistentVolumeClaim"
I0601 20:50:04.575222       1 log.go:181] FLAG: --target-name="pgadmin"
I0601 20:50:04.575230       1 log.go:181] FLAG: --use-kubeapiserver-fqdn-for-aks="true"
I0601 20:50:04.575238       1 log.go:181] FLAG: --v="3"
I0601 20:50:04.575247       1 log.go:181] FLAG: --vmodule=""
W0601 20:50:04.663968       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
Error: the server could not find the requested resource (post volumesnapshots.snapshot.storage.k8s.io)
Usage:
  stash create-vs [flags]

Flags:
      --backupsession string     Name of the respective BackupSession object
  -h, --help                     help for create-vs
      --kubeconfig string        Path to kubeconfig file with authorization information (the master location is set by the master flag).
      --master string            The address of the Kubernetes API server (overrides any value in kubeconfig)
      --metrics-enabled          Specify whether to export Prometheus metrics (default true)
      --pushgateway-url string   Pushgateway URL where the metrics will be pushed
      --target-kind string       Kind of the Target
      --target-name string       Name of the Target

Global Flags:
      --alsologtostderr                  log to standard error as well as files
      --bypass-validating-webhook-xray   if true, bypasses validating webhook xray checks
      --enable-analytics                 Send analytical events to Google Analytics (default true)
      --log-flush-frequency duration     Maximum number of seconds between log flushes (default 5s)
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --logtostderr                      log to standard error instead of files (default true)
      --service-name string              Stash service name. (default "stash-operator")
      --stderrthreshold severity         logs at or above this threshold go to stderr
      --use-kubeapiserver-fqdn-for-aks   if true, uses kube-apiserver FQDN for AKS cluster to workaround https://github.com/Azure/AKS/issues/522 (default true)
  -v, --v Level                          log level for V logs (default 0)
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging

F0601 20:50:04.706442       1 main.go:41] Error in Stash Main: the server could not find the requested resource (post volumesnapshots.snapshot.storage.k8s.io)

asoltesz avatar Jun 01 '20 20:06 asoltesz

Note: I use Kubernetes 1.15.11 with the "snapshot.storage.k8s.io/v1alpha1" API. There is now a v1beta1 API also. Can this be a problem? Which one does Stash target?

It seems that is the case. Can you try v1beta1 API? Stash uses snapshot.storage.k8s.io/v1beta1.

hossainemruz avatar Jun 01 '20 20:06 hossainemruz

Not easily.

I would have to migrate my stuff up to Kubernetes 1.17 because that API is only available from there.

Unfortunately Kubernetes 1.16+ has removed a lot of older API versions on which some of my stuff is based, so migration is non-trivial.

In any case, a big fat warning seems to be warranted in the Stash guides about the Volume Snapshotting feature currently only operational on Kubernetes 1.17+.

However, this severely limits the usability of Stash (big installations will certainly not migrate quickly to such new Kubernetes versions), so it might be worth thinking about supporting the older API version too.

asoltesz avatar Jun 01 '20 21:06 asoltesz

In any case, a big fat warning seems to be warranted in the Stash guides about the Volume Snapshotting feature currently only operational on Kubernetes 1.17+.

Aggreed.

However, this severely limits the usability of Stash (big installations will certainly not migrate quickly to such new Kubernetes versions), so it might be worth thinking about supporting the older API version too.

It won't be easy for us to support both API versions. There are some breaking changes. We would rather go with the v1beta1 api.

hossainemruz avatar Jun 01 '20 21:06 hossainemruz

It won't be easy for us to support both API versions. There are some breaking changes. We would rather go with the v1beta1 api.

I perfectly understand the issue, it would require a complexity increase in Stash.

If it is OK that major functionality of Stash only works with Kubernetes 1.17+ (a fairly recent version), then this is perfectly acceptable.

However, in order to be fair, this should be communicated clearly on the main documentation pages of Stash so that people don't waste their time with software that is not applicable for them.

Maybe there could also be a compact "Features" page in the documentation (like on the website but without fluff) that lists the major features and include the Kubernetes minimum - maximum version range it is capable of working in.

I am pretty certain that non-trivial Kubernetes installations will always have a hard time upgrading to newer versions due to the continuous backwards incompatibilities introduced into K8s versions (like the dropped v1alpha1 storage API we have just bumped into) and the wide and colorful ecosystem around it. I have a pretty hard time collecting the software that are capable of working together in a single cluster (true, I am not a Kubernetes veteran yet).

asoltesz avatar Jun 06 '20 10:06 asoltesz

Maybe there could also be a compact "Features" page in the documentation (like on the website but without fluff) that lists the major features and include the Kubernetes minimum - maximum version range it is capable of working in.

I completely agree. We already planned something like this. We planned to write documentation acknowledging limitations of current release. However, we couldn't make it happen(actually, couldn't get time to do it when we planned this then forgot it later).

Btw, we have extensive E2E tests that check against Kubernetes 1.11.x to 1.18.x. Unfortunately, we don't have any E2E tests for VolumeSnapshot. That's why this problem got unnoticed. All our tests runs in Github Action. Since, VolumeSnapshot requires cloud provider specific CSI driver. Its not practical to run such tests. However, there is hostPath CSI driver that should work but we didn't get time to explore this.

hossainemruz avatar Jun 06 '20 11:06 hossainemruz