trident
trident copied to clipboard
NFS auto export policy didn't include all workers, making a restarted pod land on a new worker which couldn't access NFS export
Describe the bug A pod that was using a read-write-many NFS PV (auto export policy) got rescheduled on a new worker node which wasn't included in export policy, leaving the pod unable to mount NFS.
Sequence of events:
- OCP 4.16 with Trident v24.06
- Read-write-many NFS PVC created for pods
- After upgrade to Trident v25.02.1, user did a rolling reboot of workers
- No new workers were added (or replaced) in the cluster
- One of the pods that was using the volume was scheduled on a new node (e.g. let's assume we had 6 nodes and previously 5 pods were on nodes 1-5, now one of them ended up on node 6)
Environment Provide accurate information about the environment to help us reproduce the issue.
- Trident version: v25.02.1
- Trident installation flags used: OpenShift operator
- Kubernetes orchestrator: OpenShift v4.16
- NetApp backend types: ONTAP NAS
To Reproduce Steps to reproduce the behavior:
- Create NFS PVC with auto export policy using Trident v24.06.1
- Upgrade Trident to v25.02.1
Expected behavior
- Export policy is updated to include the new worker node
Additional context
- We can't get the logs (dark site).
- There was a recent similar issue with export policies, but it was fixed for 25.02. I'm not sure if the fix kicks in immediately after Trident upgrade, or after a rolling reboot (in which case it'd be expected to see it then, but not again)?
Hello,
we have the same issue on OCP 4.18 and Trident 25.2.1 installed via OperatorHub (after updating from an helm based installation). The autoExportPolicy-Feature is enabled. We have a static setup, so the filter (autoExportCIDRs) contains the node ips (/32) instead a whole network. When i check the export-policy, only one or two ips where listed in the rule.
Question: Should the export-policy-rule contain all ips or does trident updates the entries dynamically, based on from whom will connect to the pvs?
apiVersion: trident.netapp.io/v1
kind: TridentBackendConfig
metadata:
creationTimestamp: "2025-01-31T09:41:50Z"
finalizers:
- trident.netapp.io
generation: 1
name: svmp
namespace: trident
resourceVersion: "136272684"
uid: 0a2a63b9-5f9f-4248-85cf-387d1d321538
spec:
autoExportCIDRs:
- 172.150.129.1/32
- 172.150.129.2/32
- 172.150.129.3/32
- 172.150.129.4/32
autoExportPolicy: true
backendName: svmp
credentials:
name: XXX
dataLIF: svmprod
managementLIF: svmp
storageDriverName: ontap-nas
svm: svmp
version: 1
@scaleoutsean if any of the pods ended up on a new node and are in an access denied state before the upgrade to 25.02.1 but after 25.02.0 when per volume export policies were introduced for ONTAP NAS, you'd have to trigger a volume unpublish by deleting/moving the pod to another node and republish once the 25.02.1 upgrade is successful. Is there a possibility the rebooted worker node (node 6 in your example) had a change in IP address? If all the volumes were previously attached before the upgrade Trident is supposed to continue to use the backend-based export policy until it is safe to migrate to the volume-based export policy. _ @gr33npr to answer your question, the export policy rule should contain only the IPs of the nodes that are mounting the volume to provide more granular access control. Your autoExportCIDRs will act as an additional filter for any IPs outside of that list but the rules will still only have the IPs necessary at mount time.
According to the K8s admin there were no changes on the workers (I thought of that too, that maybe a failed node was replaced, or such and he said no). The only worker-side change was a rolling reboot after Trident upgrade to v25.02.1.
(Am I 100% sure? No, maybe they did something else between the time the older version of Trident was installed and the time they did that rolling reboot, and forgot about it.)
Original issue has been resolved by upgrading Trident. Closing due to inactivity.