Blocking of Attach/Unattach Operations During Volume Creation/Deletion
Describe the bug We encountered an issue with the integration of the Lenovo storage system in Kubernetes while creating and deleting a large number of volumes via FCP. When multiple volumes are created simultaneously, they begin to be created sequentially. However, once some volumes are ready to be attached to the pod, the csi-attacher throws a "context deadline exceed" error. This error persists until all volumes are created. A similar situation occurs during mass volume deletion, but with the unattach operation.
We suspect that during volume creation/deletion, a lock is established that interferes with other operations.
Environment
- Trident version: 25.02
- Kubernetes version: 1.30.4
- Protocol: FCP
To Reproduce
- Deploy a StatefulSet in the cluster with the following characteristics:
- StorageClass in volumeClaimTemplate is served by the provisioner.
- podManagementPolicy=Parallel
- replicas=100
- Observe the creation process of the volumes.
- Note the "context deadline exceed" error in the csi-attacher logs.
- Repeat the process for volume deletion to observe similar behavior with the unattach operation.
Expected behavior Volumes should be created and attached (or deleted and unattached) without errors, even when operations are performed in parallel.
Actual Behavior
The csi-attacher throws a "context deadline exceed" error during the attach/unattach operations when volumes are being created/deleted in parallel.
Potential Cause
A lock might be set during volume creation/deletion, which prevents other operations from completing successfully.
Suggested Solution
Investigate the possibility of a lock being set during volume operations and explore ways to handle parallel operations more efficiently.
Hello, @P0lskay. You are correct, for its 8+ year history, the Trident controller has used a mutex to serialize all workflows. While this was a simple way to avoid concurrency issues, Trident has been deployed at ever greater scale, and the global lock has become a bottleneck. We are actively working on a solution, and early results appear promising, but it is a complicated problem and we must be careful to test it thoroughly, all of which will take multiple releases. There will likely be opportunities to try the solution yourself before it becomes the default behavior, and early feedback will be valuable.
Hello @clintonk ! I am pleased to hear that you are already developing a solution to this problem. I think this will be very relevant for large k8s clusters. I will monitor the resolution of this issue. Thanks!
Hi @P0lskay, In 25.06 we rolled out concurrent Trident controller operations as an experimental enhancement. Feel free to try it out in your test environment and give us your feedback.
NOTE: Not for use in production environments. [Tech Preview] Enabled concurrent Trident controller operations via the --enable-concurrency feature flag. This allows controller operations to run in parallel, improving performance for busy or large environments. NOTE: This feature is experimental and currently supports limited parallel workflows with the ONTAP-SAN driver (iSCSI and FCP protocols).