trident icon indicating copy to clipboard operation
trident copied to clipboard

NFS auto export policy didn't include all workers, making a restarted pod land on a new worker which couldn't access NFS export

Open scaleoutsean opened this issue 6 months ago • 3 comments

Describe the bug A pod that was using a read-write-many NFS PV (auto export policy) got rescheduled on a new worker node which wasn't included in export policy, leaving the pod unable to mount NFS.

Sequence of events:

  • OCP 4.16 with Trident v24.06
  • Read-write-many NFS PVC created for pods
  • After upgrade to Trident v25.02.1, user did a rolling reboot of workers
  • No new workers were added (or replaced) in the cluster
  • One of the pods that was using the volume was scheduled on a new node (e.g. let's assume we had 6 nodes and previously 5 pods were on nodes 1-5, now one of them ended up on node 6)

Environment Provide accurate information about the environment to help us reproduce the issue.

  • Trident version: v25.02.1
  • Trident installation flags used: OpenShift operator
  • Kubernetes orchestrator: OpenShift v4.16
  • NetApp backend types: ONTAP NAS

To Reproduce Steps to reproduce the behavior:

  • Create NFS PVC with auto export policy using Trident v24.06.1
  • Upgrade Trident to v25.02.1

Expected behavior

  • Export policy is updated to include the new worker node

Additional context

  • We can't get the logs (dark site).
  • There was a recent similar issue with export policies, but it was fixed for 25.02. I'm not sure if the fix kicks in immediately after Trident upgrade, or after a rolling reboot (in which case it'd be expected to see it then, but not again)?

scaleoutsean avatar May 28 '25 11:05 scaleoutsean

Hello,

we have the same issue on OCP 4.18 and Trident 25.2.1 installed via OperatorHub (after updating from an helm based installation). The autoExportPolicy-Feature is enabled. We have a static setup, so the filter (autoExportCIDRs) contains the node ips (/32) instead a whole network. When i check the export-policy, only one or two ips where listed in the rule.

Question: Should the export-policy-rule contain all ips or does trident updates the entries dynamically, based on from whom will connect to the pvs?

apiVersion: trident.netapp.io/v1
kind: TridentBackendConfig
metadata:
  creationTimestamp: "2025-01-31T09:41:50Z"
  finalizers:
  - trident.netapp.io
  generation: 1
  name: svmp
  namespace: trident
  resourceVersion: "136272684"
  uid: 0a2a63b9-5f9f-4248-85cf-387d1d321538
spec:
  autoExportCIDRs:
  - 172.150.129.1/32
  - 172.150.129.2/32
  - 172.150.129.3/32
  - 172.150.129.4/32
  autoExportPolicy: true
  backendName: svmp
  credentials:
    name: XXX
  dataLIF: svmprod
  managementLIF: svmp
  storageDriverName: ontap-nas
  svm: svmp
  version: 1

gr33npr avatar Jun 04 '25 08:06 gr33npr

@scaleoutsean if any of the pods ended up on a new node and are in an access denied state before the upgrade to 25.02.1 but after 25.02.0 when per volume export policies were introduced for ONTAP NAS, you'd have to trigger a volume unpublish by deleting/moving the pod to another node and republish once the 25.02.1 upgrade is successful. Is there a possibility the rebooted worker node (node 6 in your example) had a change in IP address? If all the volumes were previously attached before the upgrade Trident is supposed to continue to use the backend-based export policy until it is safe to migrate to the volume-based export policy. _ @gr33npr to answer your question, the export policy rule should contain only the IPs of the nodes that are mounting the volume to provide more granular access control. Your autoExportCIDRs will act as an additional filter for any IPs outside of that list but the rules will still only have the IPs necessary at mount time.

torirevilla avatar Jun 05 '25 15:06 torirevilla

According to the K8s admin there were no changes on the workers (I thought of that too, that maybe a failed node was replaced, or such and he said no). The only worker-side change was a rolling reboot after Trident upgrade to v25.02.1.

(Am I 100% sure? No, maybe they did something else between the time the older version of Trident was installed and the time they did that rolling reboot, and forgot about it.)

scaleoutsean avatar Jun 05 '25 15:06 scaleoutsean

Original issue has been resolved by upgrading Trident. Closing due to inactivity.

torirevilla avatar Aug 29 '25 14:08 torirevilla