csm icon indicating copy to clipboard operation
csm copied to clipboard

[BUG]: add NVMeTCP connection parameter ctrl-loss-tmo=-1 to implement powerstore best practices

Open dancohen21 opened this issue 5 months ago • 2 comments

Bug Description

The Dell Linux host connectivity guide recommends on page 214 https://elabnavigator.dell.com/vault/pdf/Linux.pdf?key=1725374107988

By default, the Linux controller enters a reconnect state when it loses connection with the target. The default timeout for reconnecting is 10 minutes. However, a PowerStore node reboot may take more than 10 minutes. It is recommended to set ctrl-loss-tmo = -1 to keep the controller constantly reconnecting.

Per this SUSE documentation [https://documentation.suse.com/sles/15-SP5/html/SLES-all/cha-nvmeof.html] In case of a path loss, the NVMe subsystem tries to reconnect for a time period, defined by the ctrl-loss-tmo option of the nvme connect command

I'm concerned that this ctrl-loss-tmo = -1 parameter will be required for the NVMeTCP connection to reconnect to PowerStore nodes when performing a PowerStore NDU (non-disruptive code upgrade) where the PowerStore nodes reboot, one at a time, and during a code update, the nodes very well may be unavailable for longer than the default path timeout.

My novice reading of the code: nvmeTCPConnect function in gonvme_tcp_fc.go does not include this parameter

if duplicateConnect { exe = nvme.buildNVMeCommand([]string{NVMeCommand, "connect", "-t", "tcp", "-n", target.TargetNqn, "-a", target.Portal, "-s", NVMePort, "-D"}) } else { exe = nvme.buildNVMeCommand([]string{NVMeCommand, "connect", "-t", "tcp", "-n", target.TargetNqn, "-a", target.Portal, "-s", NVMePort}) }

If a change is needed; I also request that current supported CSI-powerstore driver builds be updated so that (for example) an OpenShift 4.14 environment using CSM-Operator 1.5.1 and CSI driver 2.10.1 can get this enhancement

Logs

no logs available ; see Dell SR 197072815

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

Perform a PowerStore code upgrade / NDU from 3.6.0.0 to 3.6.1.2 for example with OpenShift attached using PVs

Expected Behavior

Hosts should be able to survive paths to storage going away and coming back during all normal data center operations

CSM Driver(s)

csi-powerstore 2.10.1

Installation Type

csm-operator 1.5.1

Container Storage Modules Enabled

No response

Container Orchestrator

OpenShift 4.14

Operating System

OpenShift Linux - RHCOS based on RHEL 9.2

dancohen21 avatar Sep 10 '24 17:09 dancohen21