democratic-csi `freenas-api-iscsi`: Concurrently taken snapshots become stuck

I'm using freenas-api-iscsi in conjunction with VolSync to take regular backups of my data. VolSync takes a snapshot of the PV, mounts it and copies the data offsite. While this is generally functional, scheduling multiple backups at the same time results in some of them becoming stuck.

Right now I have scheduled five different backups at 2am. Every morning, three have been successfully completed and two are stuck in a pending state. Apparently, the driver couldn't create some of the snapshot and timed out after 90 seconds:

$ kubectl -n sonarr get volumesnapshots -o yaml
apiVersion: v1
items:
- apiVersion: snapshot.storage.k8s.io/v1
  kind: VolumeSnapshot
  metadata:
    creationTimestamp: "2023-10-21T02:00:00Z"
    finalizers:
    - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
    - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
    generation: 1
    labels:
      app.kubernetes.io/created-by: volsync
      volsync.backube/cleanup: 390ed6ce-1aa3-4b2d-8788-f7b1a998c62a
    name: volsync-sonarr-backup-src
    namespace: sonarr
    ownerReferences:
    - apiVersion: volsync.backube/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicationSource
      name: sonarr-backup
      uid: 390ed6ce-1aa3-4b2d-8788-f7b1a998c62a
    resourceVersion: "37473618"
    uid: 196a9cf2-6bc4-4713-a823-38eb6363202e
  spec:
    source:
      persistentVolumeClaimName: data-sonarr-0
    volumeSnapshotClassName: freenas-api-iscsi-csi
  status:
    boundVolumeSnapshotContentName: snapcontent-196a9cf2-6bc4-4713-a823-38eb6363202e
    error:
      message: 'Failed to check and update snapshot content: failed to take snapshot
        of the volume pvc-2f49a909-d88e-417e-a6a7-94a7e778f49c: "rpc error: code =
        Aborted desc = operation locked due to in progress operation(s): [\"create_snapshot_name_snapshot-196a9cf2-6bc4-4713-a823-38eb6363202e\",\"volume_id_pvc-2f49a909-d88e-417e-a6a7-94a7e778f49c\"]"'
      time: "2023-10-21T02:01:30Z"
    readyToUse: false
kind: List
metadata:
  resourceVersion: ""

TrueNAS shows the operation as 'in progress': 20231021_11h34m18s_grim

The actual snapshot has not been created: 20231021_11h35m38s_grim

Restarting the democratic-csi controller causes it to pick up the failed snapshots and successfully create them (sometimes multiple restarts are necessary because one times out again). TrueNAS keeps displaying the jobs in the UI until I reboot the server or manually restart the middleware.

I can work around the issue by scheduling the backups five minutes apart from each other.

The helm release (I'm using Talos):

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: democratic-csi-freenas-api-iscsi
spec:
  interval: 10m
  chart:
    spec:
      chart: democratic-csi
      version: 0.14.1
      sourceRef:
        kind: HelmRepository
        name: democratic-csi-charts
        namespace: flux-system
      interval: 15m
  values:
    node:
      hostPID: true
      driver:
        extraEnv:
          - name: ISCSIADM_HOST_STRATEGY
            value: nsenter
          - name: ISCSIADM_HOST_PATH
            value: /usr/local/sbin/iscsiadm
        iscsiDirHostPath: /usr/local/etc/iscsi
        iscsiDirHostPathType: ""
    csiDriver:
      name: org.democratic-csi.freenas-api-iscsi
    storageClasses:
      - name: freenas-api-iscsi-csi
        defaultClass: true
        reclaimPolicy: Delete
        volumeBindingMode: Immediate
        allowVolumeExpansion: true
        parameters:
          fsType: ext4
        mountOptions: []
        secrets:
          node-stage-secret:
            node-db.node.session.auth.authmethod: CHAP
            node-db.node.session.auth.username: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_USERNAME}
            node-db.node.session.auth.password: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_PASSWORD}
            node-db.node.session.auth.username_in: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_USERNAME_IN}
            node-db.node.session.auth.password_in: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_PASSWORD_IN}
    volumeSnapshotClasses:
      - name: freenas-api-iscsi-csi
        parameters:
          detachedSnapshots: true
    driver:
      config:
        driver: freenas-api-iscsi
        httpConnection:
          protocols: https
          host: storage.internal.paulfriedrich.me
          port: 443
          apiKey: ${TRUENAS_DEMOCRATIC_CSI_API_KEY}
          allowInsecure: false
        zfs:
          datasetProperties:
            "org.freenas:description": "{{ parameters.[csi.storage.k8s.io/pvc/namespace] }}/{{ parameters.[csi.storage.k8s.io/pvc/name] }}"
          datasetParentName: local/k8s/volumes
          detachedSnapshotsDatasetParentName: local/k8s/snapshots
          zvolEnableReservation: false
        iscsi:
          targetPortal: "10.0.60.3:3260"
          nameTemplate: "{{ parameters.[csi.storage.k8s.io/pvc/namespace] }}-{{ parameters.[csi.storage.k8s.io/pvc/name] }}"
          namePrefix: "csi-"
          targetGroups:
            - targetGroupAuthGroup: 1
              targetGroupAuthType: CHAP Mutual
              targetGroupInitiatorGroup: 1
              targetGroupPortalGroup: 2
          extentInsecureTpc: true
          extentXenCompat: false
          extentDisablePhysicalBlocksize: true
          extentBlocksize: 512
          extentRpm: SSD
          extentAvailThreshold: 80

The installed snapshot-controller is the democratic-csi one:

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: democratic-csi-snapshot-controller
spec:
  interval: 10m
  chart:
    spec:
      chart: snapshot-controller
      version: 0.2.4
      sourceRef:
        kind: HelmRepository
        name: democratic-csi-charts
        namespace: flux-system
      interval: 15m

TrueNAS is running version TrueNAS-SCALE-22.12.3.3.

Oct 21 '23 09:10 pf56

So there are 2 different failures?:

csi drivers is failing to submit the jobs
TN is failing to actually execute the jobs

Can you send the controller logs showing csi failing to submit?

Oct 21 '23 11:10 travisghansen

I don't have the logs from tonight, because I restarted the controller. But I ran them again at 12:40 (UTC).

This one is the one that's stuck this time:

➜ kubectl get --all-namespaces volumesnapshots
NAMESPACE   NAME                        READYTOUSE   SOURCEPVC       SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS           SNAPSHOTCONTENT                                    CREATIONTIME   AGE
radarr      volsync-radarr-backup-src   false        data-radarr-0                                         freenas-api-iscsi-csi   snapcontent-2c4f71e3-b0b3-47f9-bad9-89cce9d3459f                  25m

The logs of the containers: csi-driver csi-proxy external-snapshotter external-provisioner external-attacher external-resizer

Oct 21 '23 13:10 pf56

Hey, I have a very similar issue with freenas-api-nfs driver. I usually get 25 snapshots get taken at the same time, but only 4 succeed, and the remaining get stuck. The snapshot errors bubble up to PVCs that use those snapshots and PVCs have these logs among the events section:

  Type     Reason                Age                  From                                                                                                                                Message
  ----     ------                ----                 ----                                                                                                                                -------
  Warning  ProvisioningFailed    16m                  freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d  failed to provision volume with StorageClass "truenas-nfs-main": rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  ProvisioningFailed    13m (x6 over 15m)    freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d  failed to provision volume with StorageClass "truenas-nfs-main": rpc error: code = Aborted desc = operation locked due to in progress operation(s): ["create_volume_name_pvc-345a1b4d-b0bd-4441-bd04-3f17cfd531cd"]
  Normal   ExternalProvisioning  3m5s (x62 over 18m)  persistentvolume-controller                                                                                                         Waiting for a volume to be created either by the external provisioner 'freenas-api-nfs.cluster.local' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal   Provisioning          53s (x12 over 18m)   freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d  External provisioner is provisioning volume for claim "default/volsync-jellyseerr-src"
  Warning  ProvisioningFailed    53s (x5 over 12m)    freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d  failed to provision volume with StorageClass "truenas-nfs-main": rpc error: code = AlreadyExists desc = volume has already been created with a different size, existing size: 1073741824, required_bytes: 9214464, limit_bytes: 0

From these messages, it seems to be attempting to create, but gets locked, and when it comes second time around to create this snapshot, the snapshot already exists but with a different size.

Usually when snapshots are happening my TrueNas server CPU is 100% loaded. I found that if I delete 4 PVCs at a time that depend on these snapshots, then PVCs proceed and my volsync snapshots finish. However, using this approach, I get dangling snapshots and NFS export entries that do not get cleaned up.

Also I found that TrueNAS settings have a limit on concurrent replication tasks (I wonder if it hits against this limit).

Is it possible to add a bounded queue of provision requests into the driver, which would limit snapshot concurrency? It would be nice if the size of the queue could be configurable too. This way new requests won't be sent to TrueNAS server before previous requests clear out from the queue.

Apr 09 '24 22:04 qlonik

Hello, I also seem to have this issue. I have 12 volsync replicationsources that trigger backup at the same time, and their src PVCs get stuck in Pending similarly to above reports.

I noticed that after limiting Replication Tasks to 1 in Truenas Advanced settings, most of the PVCs eventually succeed (within 20 min), except exactly 2 (not always the same 2) that will remain stuck in Pending indefinitely (I assume, I waited for a few hours). I can also see 2 active jobs in the Truenas dashboard (replication.run_onetime) at this point.

At this point, if I delete the 2 stuck PVCs, volsync will schedule new ones and those get provisioned. The 2 active jobs remain until Truenas reboot however.

I can confirm this behavior on both 23.10 and now on 24.04.0 after upgrading this morning

Apr 24 '24 01:04 H3mul

I have been having the same issue. I have checked the logs on my cluster, as well as on TrueNAS, and cannot find anything useful. I also don't have an actual solution to the problem.

However, I have done some dev work with TrueNAS SCALE before, so I have made this script, and set it up on a cron job to run 20 minutes after the replications usually start.

It does the following:

Grabs all jobs that are RUNNING, replication.run_onetime, and where the progress.description is empty, for whatever reason the description is not populated on these jobs..
Checks those jobs if they have been running longer than 10 minutes
Aborts those jobs
Removes the destination dataset (not recursive) (also not sure if this is needed)

This seems to work well and the jobs that were aborted run again after a minute or two of this script running, resulting in success. Feel free to use it, alter it, whatever.

from datetime import datetime, timedelta, timezone
from middlewared.client import Client
import subprocess

def main():
    # Initialize Middleware Client
    middleware = Client()

    try:
        # Get current time as a timezone-aware datetime in UTC
        current_time = datetime.now(timezone.utc)
        ten_minutes_ago = current_time - timedelta(minutes=10)

        filters = [
            ['state', '=', 'RUNNING'],
            ['method', '=', 'replication.run_onetime'],
            ['progress.description', '=', '']
        ]
        jobs = middleware.call('core.get_jobs', filters)

        for job in jobs:
            start_time = job['time_started']

            # Check if the job started more than 10 minutes ago
            if start_time <= ten_minutes_ago:
                job_id = job['id']
                start_time_formatted = start_time.strftime("%d %B %Y %H:%M:%S")
                running_time = current_time - start_time
                running_time_minutes = running_time.total_seconds() // 60.0

                print(f'{job_id}')
                print(f'- Started: {start_time_formatted}')
                print(f'- Runtime: {running_time_minutes}m')

                print(f'- Aborting: {job_id}')

                # Abort the job
                abort_result = middleware.call('core.job_abort', job_id)
                if abort_result is None:
                    print(f"-- Success")
                else:
                    print(f"-- Failed")

                # Parse and format target dataset path
                target_dataset = job['arguments'][0]['target_dataset']
                target_dataset = target_dataset.rsplit('/snapshot-', 1)[0]
                print(f'- Destroying ZFS dataset "{target_dataset}"')

                # Quote dataset path, in case it contains spaces
                destroy_command = f'zfs destroy "{target_dataset}"'

                # Destroy the ZFS dataset
                destroy_result = subprocess.run(destroy_command, capture_output=True, shell=True)
                if destroy_result.returncode == 0:
                    print(f"-- Success")
                else:
                    print(f"-- Failed")
                print("")

    finally:
        # Close the middleware client connection
        middleware.close()

if __name__ == "__main__":
    main()

Jul 08 '24 00:07 heavybullets8

Can confirm the same issue with TrueNAS Scale Dragonfish-24.04.2 and Talos 1.7.5 with Kubernetes v1.30.2. Script from @Heavybullets8 seems to be able to remove those stuck tasks and get them to be retried.

Jul 12 '24 14:07 alfi0812

For what it is worth, I switched Volsync's ReplicationSource restic.copyMethod to Clone and the issue is gone, it is much faster and unless you have a specific need for snapshots, Clone seems to work flawlessly.

Jul 18 '24 16:07 heavybullets8

For what it is worth, I switched Volsync's ReplicationSource restic.copyMethod to Clone and the issue is gone, it is much faster and unless you have a specific need for snapshots, Clone seems to work flawlessly.

Interesting, are you using Clone with freenas-api-iscsi or freenas-api-nfs? I'd like to give it a try, but I had issues with cloning iscsi vols before

Jul 18 '24 19:07 H3mul

Sounds like putting a queue/limit in the driver wouldn’t be much help as TN itself already has an ability to do that?

Can anyone here switch to an ssh based driver variant and observe the behavior? The TN api/middleware may be generally misbehaving.

Jul 18 '24 20:07 travisghansen