Nodes/PODs lost access to iSCSI based PVCs during / after upgrade to 25.02.1
Describe the bug Recently we have performed trident upgrade to 25.02.1 (from 24.10.1)
After upgrade was done by ‘operator’, and trident-node PODs were replaced – we faced tragic consequences – PODs lost access to bigger part of volumes (but not all of them). The same thing happened when we’ve upgraded from 24.10.0 -> 24.10.1
Before upgrade clusters were checked: all running PODs had access to all PVCs volumes, both paths under ‘multipath –ll’ were healthy for all dm-xxx devices, and no orphaned / ghosts block devices /dev/sdxx were present in the system before upgrade)
What we have observed was that after upgrade trident-node PODs were not able to fully start:
Main reason for that was that driver-registrar container was not able to fully start:
It seems that for unknown reason trident-main was not able to ’determine published paths for volumes’:
And it tried to remove them from multipath and unmount filesystem (active volumes of active PODs):
In the logs it is complaining that it was not able to flush multipath and Umount devices – technically it errors with ‘map or partition in use’ however PODs had lost access to blockdevices anyway ...
Error from console of one of the nodes:
Useful information is that on the trident-controller side there was no messages ‘Unpublishing volume from the node’ - so it was not removed on igroup / Netapp level but rather it seems like a bug in trident-node logic: it somehow decided that some volumes are not used (???) and must be unmounted / removed from OS
Environment
- Trident version: upgrade from 24.10.1 to 25.02.1 (via operator)
- Container runtime: docker://26.1.0
- Kubernetes version: v1.30.6
- Kubernetes orchestrator: rancher 2.10.3
- OS: [Flatcar Container Linux by Kinvolk 4081.2.0 (Oklo)
Hi @ptrkmkslv We will need to collect additional logs to investigate this issue further. Can you please open a NetApp Support Case ?
Hi @ptrkmkslv . Has this issue been resolved ?