velero icon indicating copy to clipboard operation
velero copied to clipboard

Crash with SIGSEGV while finalizing backup of a PVC with CSI on AWS EKS

Open Va1 opened this issue 1 year ago • 0 comments

What steps did you take and what happened: Velero 1.9.0 is deployed on AWS EKS 1.22 via an official Helm chart v2.31.0. Plugins: AWS v1.5.0, CSI v0.3.0.

Upon backing up, right after CSI snapshots are created (both VolumeSnapshot, VolumeSnapshotContent in proper statuses and EBS snapshot desplays ready in AWS console) and backup is about to wrap up, Velero crashes with SIGSEGV. Backup stays in a Failed status.

Retried multiple times and it always ends this way.

What did you expect to happen: Backup succeeds and is restorable.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

Can not provide this at the moment.

But here are the logs printed prior to a crash:

2022/08/11 16:23:56  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
2022/08/11 16:24:01  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
2022/08/11 16:24:06  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
2022/08/11 16:24:11  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
time="2022-08-11T16:24:12Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130"
time="2022-08-11T16:24:12Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115"
time="2022-08-11T16:24:12Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130"
time="2022-08-11T16:24:12Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115"
2022/08/11 16:24:16  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
2022/08/11 16:24:21  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
I0811 16:24:23.683210       1 request.go:665] Waited for 1.046988495s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/apis/apiextensions.k8s.io/v1?timeout=32s
2022/08/11 16:24:26  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x19bfdcd]

goroutine 5971 [running]:
github.com/vmware-tanzu/velero/pkg/controller.(*backupController).deleteVolumeSnapshot.func1(0xc00045f040)
        /go/src/github.com/vmware-tanzu/velero/pkg/controller/backup_controller.go:931 +0xad
created by github.com/vmware-tanzu/velero/pkg/controller.(*backupController).deleteVolumeSnapshot
        /go/src/github.com/vmware-tanzu/velero/pkg/controller/backup_controller.go:927 +0xf7

A backup in question (one of) in yaml format:

apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    helm.sh/hook: post-install,post-upgrade,post-rollback
    helm.sh/hook-delete-policy: before-hook-creation
    velero.io/source-cluster-k8s-gitversion: v1.22.10-eks-84b4fe6
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: 22+
  creationTimestamp: "2022-08-11T23:00:39Z"
  generation: 5
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    helm.sh/chart: velero-2.31.0
    velero.io/schedule-name: velero-questdb-pvc
    velero.io/storage-location: default
  name: velero-questdb-pvc-20220811230039
  namespace: velero
  resourceVersion: "29774925"
  uid: 6358f885-1184-45a6-922b-9b87b33054c1
spec:
  defaultVolumesToRestic: false
  hooks: {}
  includeClusterResources: true
  includedNamespaces:
  - ohlc
  includedResources:
  - pvc
  - pv
  labelSelector:
    matchLabels:
      app.kubernetes.io/instance: questdb
      app.kubernetes.io/name: questdb
  metadata: {}
  snapshotVolumes: true
  storageLocation: default
  ttl: 168h0m0s
  volumeSnapshotLocations:
  - default
status:
  completionTimestamp: "2022-08-11T23:00:49Z"
  expiration: "2022-08-18T23:00:39Z"
  failureReason: get a backup with status "InProgress" during the server starting,
    mark it as "Failed"
  formatVersion: 1.1.0
  phase: Failed
  progress:
    itemsBackedUp: 2
    totalItems: 2
  startTimestamp: "2022-08-11T23:00:39Z"
  version: 1

A describe of a PersistentVolume created by a backup (one of):

Name:         velero-questdb-questdb-0-x84zb
Namespace:    ohlc
Labels:       velero.io/backup-name=velero-questdb-pvc-20220811230039
Annotations:  <none>
API Version:  snapshot.storage.k8s.io/v1
Kind:         VolumeSnapshot
Metadata:
  Creation Timestamp:  2022-08-11T23:00:39Z
  Finalizers:
    snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
    snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  Generate Name:  velero-questdb-questdb-0-
  Generation:     1
  Managed Fields:
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection":
          v:"snapshot.storage.kubernetes.io/volumesnapshot-bound-protection":
    Manager:      Go-http-client
    Operation:    Update
    Time:         2022-08-11T23:00:39Z
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
        f:labels:
          .:
          f:velero.io/backup-name:
      f:spec:
        .:
        f:source:
          .:
          f:persistentVolumeClaimName:
        f:volumeSnapshotClassName:
    Manager:      velero-plugin-for-csi
    Operation:    Update
    Time:         2022-08-11T23:00:39Z
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:boundVolumeSnapshotContentName:
        f:creationTime:
        f:readyToUse:
        f:restoreSize:
    Manager:         Go-http-client
    Operation:       Update
    Subresource:     status
    Time:            2022-08-11T23:00:40Z
  Resource Version:  29774856
  UID:               56d87f8f-5a15-4c36-9930-35359c2c23c1
Spec:
  Source:
    Persistent Volume Claim Name:  questdb-questdb-0
  Volume Snapshot Class Name:      questdb-vsc
Status:
  Bound Volume Snapshot Content Name:  snapcontent-56d87f8f-5a15-4c36-9930-35359c2c23c1
  Creation Time:                       2022-08-11T23:00:40Z
  Ready To Use:                        true
  Restore Size:                        50Gi
Events:                                <none>

A describe of a PersistentVolumeContent created by a backup (one of):

Name:         snapcontent-56d87f8f-5a15-4c36-9930-35359c2c23c1
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  snapshot.storage.k8s.io/v1
Kind:         VolumeSnapshotContent
Metadata:
  Creation Timestamp:  2022-08-11T23:00:39Z
  Finalizers:
    snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
  Generation:  1
  Managed Fields:
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection":
      f:spec:
        .:
        f:deletionPolicy:
        f:driver:
        f:source:
          .:
          f:volumeHandle:
        f:volumeSnapshotClassName:
        f:volumeSnapshotRef:
          .:
          f:apiVersion:
          f:kind:
          f:name:
          f:namespace:
          f:resourceVersion:
          f:uid:
    Manager:      Go-http-client
    Operation:    Update
    Time:         2022-08-11T23:00:40Z
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:creationTime:
        f:readyToUse:
        f:restoreSize:
        f:snapshotHandle:
    Manager:         Go-http-client
    Operation:       Update
    Subresource:     status
    Time:            2022-08-11T23:00:40Z
  Resource Version:  29774845
  UID:               dd15120a-fa73-4a9f-b3d7-28102e169489
Spec:
  Deletion Policy:  Delete
  Driver:           ebs.csi.aws.com
  Source:
    Volume Handle:             vol-069935c75bcc9a2db
  Volume Snapshot Class Name:  questdb-vsc
  Volume Snapshot Ref:
    API Version:       snapshot.storage.k8s.io/v1
    Kind:              VolumeSnapshot
    Name:              velero-questdb-questdb-0-x84zb
    Namespace:         ohlc
    Resource Version:  29774811
    UID:               56d87f8f-5a15-4c36-9930-35359c2c23c1
Status:
  Creation Time:    1660258840065000000
  Ready To Use:     true
  Restore Size:     53687091200
  Snapshot Handle:  snap-08a0e7632dac36f3f
Events:             <none>

Chart values overrides:

configuration:
  features: EnableCSI
  provider: aws
  backupStorageLocation:
    name: default
    provider: aws
    bucket: ***-velero-backup-storage
    config:
      region: eu-central-1
  volumeSnapshotLocation:
    name: default
    provider: aws
    config:
      region: eu-central-1

credentials:
  useSecret: false

initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.5.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
  - name: velero-plugin-for-csi
    image: velero/velero-plugin-for-csi:v0.3.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins

schedules:
  questdb-pvc:
    disabled: false
    schedule: "0 23 * * 1,2,3,4,5"
    csiSnapshotTimeout: 60m
    template:
      ttl: "168h"
      includedNamespaces:
        - ohlc
      includedResources:
        - pvc
        - pv
      labelSelector:
        matchLabels:
          app.kubernetes.io/name: questdb
          app.kubernetes.io/instance: questdb
      includeClusterResources: true
      snapshotVolumes: true
      storageLocation: default
      volumeSnapshotLocations:
        - default

serviceAccount:
  server:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::***:role/***-velero

Anything else you would like to add:

Environment:

  • Velero version: 1.9.0
  • velero-plugin-for-aws version: 1.5.0
  • velero-plugin-for-csi version: 0.3.0
  • Velero features: EnableCSI
  • Helm chart version: 2.31.0
  • Kubernetes version: v1.22.10-eks-84b4fe6
  • Cloud provider or hardware configuration: AWS EKS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • :+1: for "I would like to see this bug fixed as soon as possible"
  • :-1: for "There are more important bugs to focus on right now"

Va1 avatar Aug 12 '22 09:08 Va1