`freenas-api-iscsi`: Concurrently taken snapshots become stuck
I'm using freenas-api-iscsi in conjunction with VolSync to take regular backups of my data. VolSync takes a snapshot of the PV, mounts it and copies the data offsite. While this is generally functional, scheduling multiple backups at the same time results in some of them becoming stuck.
Right now I have scheduled five different backups at 2am. Every morning, three have been successfully completed and two are stuck in a pending state. Apparently, the driver couldn't create some of the snapshot and timed out after 90 seconds:
$ kubectl -n sonarr get volumesnapshots -o yaml
apiVersion: v1
items:
- apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
creationTimestamp: "2023-10-21T02:00:00Z"
finalizers:
- snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
- snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
generation: 1
labels:
app.kubernetes.io/created-by: volsync
volsync.backube/cleanup: 390ed6ce-1aa3-4b2d-8788-f7b1a998c62a
name: volsync-sonarr-backup-src
namespace: sonarr
ownerReferences:
- apiVersion: volsync.backube/v1alpha1
blockOwnerDeletion: true
controller: true
kind: ReplicationSource
name: sonarr-backup
uid: 390ed6ce-1aa3-4b2d-8788-f7b1a998c62a
resourceVersion: "37473618"
uid: 196a9cf2-6bc4-4713-a823-38eb6363202e
spec:
source:
persistentVolumeClaimName: data-sonarr-0
volumeSnapshotClassName: freenas-api-iscsi-csi
status:
boundVolumeSnapshotContentName: snapcontent-196a9cf2-6bc4-4713-a823-38eb6363202e
error:
message: 'Failed to check and update snapshot content: failed to take snapshot
of the volume pvc-2f49a909-d88e-417e-a6a7-94a7e778f49c: "rpc error: code =
Aborted desc = operation locked due to in progress operation(s): [\"create_snapshot_name_snapshot-196a9cf2-6bc4-4713-a823-38eb6363202e\",\"volume_id_pvc-2f49a909-d88e-417e-a6a7-94a7e778f49c\"]"'
time: "2023-10-21T02:01:30Z"
readyToUse: false
kind: List
metadata:
resourceVersion: ""
TrueNAS shows the operation as 'in progress':
The actual snapshot has not been created:
Restarting the democratic-csi controller causes it to pick up the failed snapshots and successfully create them (sometimes multiple restarts are necessary because one times out again). TrueNAS keeps displaying the jobs in the UI until I reboot the server or manually restart the middleware.
I can work around the issue by scheduling the backups five minutes apart from each other.
The helm release (I'm using Talos):
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: democratic-csi-freenas-api-iscsi
spec:
interval: 10m
chart:
spec:
chart: democratic-csi
version: 0.14.1
sourceRef:
kind: HelmRepository
name: democratic-csi-charts
namespace: flux-system
interval: 15m
values:
node:
hostPID: true
driver:
extraEnv:
- name: ISCSIADM_HOST_STRATEGY
value: nsenter
- name: ISCSIADM_HOST_PATH
value: /usr/local/sbin/iscsiadm
iscsiDirHostPath: /usr/local/etc/iscsi
iscsiDirHostPathType: ""
csiDriver:
name: org.democratic-csi.freenas-api-iscsi
storageClasses:
- name: freenas-api-iscsi-csi
defaultClass: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
parameters:
fsType: ext4
mountOptions: []
secrets:
node-stage-secret:
node-db.node.session.auth.authmethod: CHAP
node-db.node.session.auth.username: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_USERNAME}
node-db.node.session.auth.password: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_PASSWORD}
node-db.node.session.auth.username_in: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_USERNAME_IN}
node-db.node.session.auth.password_in: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_PASSWORD_IN}
volumeSnapshotClasses:
- name: freenas-api-iscsi-csi
parameters:
detachedSnapshots: true
driver:
config:
driver: freenas-api-iscsi
httpConnection:
protocols: https
host: storage.internal.paulfriedrich.me
port: 443
apiKey: ${TRUENAS_DEMOCRATIC_CSI_API_KEY}
allowInsecure: false
zfs:
datasetProperties:
"org.freenas:description": "{{ parameters.[csi.storage.k8s.io/pvc/namespace] }}/{{ parameters.[csi.storage.k8s.io/pvc/name] }}"
datasetParentName: local/k8s/volumes
detachedSnapshotsDatasetParentName: local/k8s/snapshots
zvolEnableReservation: false
iscsi:
targetPortal: "10.0.60.3:3260"
nameTemplate: "{{ parameters.[csi.storage.k8s.io/pvc/namespace] }}-{{ parameters.[csi.storage.k8s.io/pvc/name] }}"
namePrefix: "csi-"
targetGroups:
- targetGroupAuthGroup: 1
targetGroupAuthType: CHAP Mutual
targetGroupInitiatorGroup: 1
targetGroupPortalGroup: 2
extentInsecureTpc: true
extentXenCompat: false
extentDisablePhysicalBlocksize: true
extentBlocksize: 512
extentRpm: SSD
extentAvailThreshold: 80
The installed snapshot-controller is the democratic-csi one:
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: democratic-csi-snapshot-controller
spec:
interval: 10m
chart:
spec:
chart: snapshot-controller
version: 0.2.4
sourceRef:
kind: HelmRepository
name: democratic-csi-charts
namespace: flux-system
interval: 15m
TrueNAS is running version TrueNAS-SCALE-22.12.3.3.
So there are 2 different failures?:
- csi drivers is failing to submit the jobs
- TN is failing to actually execute the jobs
Can you send the controller logs showing csi failing to submit?
I don't have the logs from tonight, because I restarted the controller. But I ran them again at 12:40 (UTC).
This one is the one that's stuck this time:
➜ kubectl get --all-namespaces volumesnapshots
NAMESPACE NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE
radarr volsync-radarr-backup-src false data-radarr-0 freenas-api-iscsi-csi snapcontent-2c4f71e3-b0b3-47f9-bad9-89cce9d3459f 25m
The logs of the containers: csi-driver csi-proxy external-snapshotter external-provisioner external-attacher external-resizer
Hey, I have a very similar issue with freenas-api-nfs driver. I usually get 25 snapshots get taken at the same time, but only 4 succeed, and the remaining get stuck. The snapshot errors bubble up to PVCs that use those snapshots and PVCs have these logs among the events section:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 16m freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d failed to provision volume with StorageClass "truenas-nfs-main": rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning ProvisioningFailed 13m (x6 over 15m) freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d failed to provision volume with StorageClass "truenas-nfs-main": rpc error: code = Aborted desc = operation locked due to in progress operation(s): ["create_volume_name_pvc-345a1b4d-b0bd-4441-bd04-3f17cfd531cd"]
Normal ExternalProvisioning 3m5s (x62 over 18m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'freenas-api-nfs.cluster.local' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
Normal Provisioning 53s (x12 over 18m) freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d External provisioner is provisioning volume for claim "default/volsync-jellyseerr-src"
Warning ProvisioningFailed 53s (x5 over 12m) freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d failed to provision volume with StorageClass "truenas-nfs-main": rpc error: code = AlreadyExists desc = volume has already been created with a different size, existing size: 1073741824, required_bytes: 9214464, limit_bytes: 0
From these messages, it seems to be attempting to create, but gets locked, and when it comes second time around to create this snapshot, the snapshot already exists but with a different size.
Usually when snapshots are happening my TrueNas server CPU is 100% loaded. I found that if I delete 4 PVCs at a time that depend on these snapshots, then PVCs proceed and my volsync snapshots finish. However, using this approach, I get dangling snapshots and NFS export entries that do not get cleaned up.
Also I found that TrueNAS settings have a limit on concurrent replication tasks (I wonder if it hits against this limit).
Is it possible to add a bounded queue of provision requests into the driver, which would limit snapshot concurrency? It would be nice if the size of the queue could be configurable too. This way new requests won't be sent to TrueNAS server before previous requests clear out from the queue.
Hello, I also seem to have this issue. I have 12 volsync replicationsources that trigger backup at the same time, and their src PVCs get stuck in Pending similarly to above reports.
I noticed that after limiting Replication Tasks to 1 in Truenas Advanced settings, most of the PVCs eventually succeed (within 20 min), except exactly 2 (not always the same 2) that will remain stuck in Pending indefinitely (I assume, I waited for a few hours). I can also see 2 active jobs in the Truenas dashboard (replication.run_onetime) at this point.
At this point, if I delete the 2 stuck PVCs, volsync will schedule new ones and those get provisioned. The 2 active jobs remain until Truenas reboot however.
I can confirm this behavior on both 23.10 and now on 24.04.0 after upgrading this morning
I have been having the same issue. I have checked the logs on my cluster, as well as on TrueNAS, and cannot find anything useful. I also don't have an actual solution to the problem.
However, I have done some dev work with TrueNAS SCALE before, so I have made this script, and set it up on a cron job to run 20 minutes after the replications usually start.
It does the following:
- Grabs all jobs that are
RUNNING,replication.run_onetime, and where theprogress.descriptionis empty, for whatever reason the description is not populated on these jobs.. - Checks those jobs if they have been running longer than 10 minutes
- Aborts those jobs
- Removes the destination dataset (not recursive) (also not sure if this is needed)
This seems to work well and the jobs that were aborted run again after a minute or two of this script running, resulting in success. Feel free to use it, alter it, whatever.
from datetime import datetime, timedelta, timezone
from middlewared.client import Client
import subprocess
def main():
# Initialize Middleware Client
middleware = Client()
try:
# Get current time as a timezone-aware datetime in UTC
current_time = datetime.now(timezone.utc)
ten_minutes_ago = current_time - timedelta(minutes=10)
filters = [
['state', '=', 'RUNNING'],
['method', '=', 'replication.run_onetime'],
['progress.description', '=', '']
]
jobs = middleware.call('core.get_jobs', filters)
for job in jobs:
start_time = job['time_started']
# Check if the job started more than 10 minutes ago
if start_time <= ten_minutes_ago:
job_id = job['id']
start_time_formatted = start_time.strftime("%d %B %Y %H:%M:%S")
running_time = current_time - start_time
running_time_minutes = running_time.total_seconds() // 60.0
print(f'{job_id}')
print(f'- Started: {start_time_formatted}')
print(f'- Runtime: {running_time_minutes}m')
print(f'- Aborting: {job_id}')
# Abort the job
abort_result = middleware.call('core.job_abort', job_id)
if abort_result is None:
print(f"-- Success")
else:
print(f"-- Failed")
# Parse and format target dataset path
target_dataset = job['arguments'][0]['target_dataset']
target_dataset = target_dataset.rsplit('/snapshot-', 1)[0]
print(f'- Destroying ZFS dataset "{target_dataset}"')
# Quote dataset path, in case it contains spaces
destroy_command = f'zfs destroy "{target_dataset}"'
# Destroy the ZFS dataset
destroy_result = subprocess.run(destroy_command, capture_output=True, shell=True)
if destroy_result.returncode == 0:
print(f"-- Success")
else:
print(f"-- Failed")
print("")
finally:
# Close the middleware client connection
middleware.close()
if __name__ == "__main__":
main()
Can confirm the same issue with TrueNAS Scale Dragonfish-24.04.2 and Talos 1.7.5 with Kubernetes v1.30.2. Script from @Heavybullets8 seems to be able to remove those stuck tasks and get them to be retried.
For what it is worth, I switched Volsync's ReplicationSource restic.copyMethod to Clone and the issue is gone, it is much faster and unless you have a specific need for snapshots, Clone seems to work flawlessly.
For what it is worth, I switched Volsync's
ReplicationSourcerestic.copyMethodtoCloneand the issue is gone, it is much faster and unless you have a specific need for snapshots, Clone seems to work flawlessly.
Interesting, are you using Clone with freenas-api-iscsi or freenas-api-nfs? I'd like to give it a try, but I had issues with cloning iscsi vols before
Sounds like putting a queue/limit in the driver wouldn’t be much help as TN itself already has an ability to do that?
Can anyone here switch to an ssh based driver variant and observe the behavior? The TN api/middleware may be generally misbehaving.