error decrypting iscsi username
Describe the bug
When mounting our pvc we see the following error: error decrypting iscsi username, strangly this issue does only happen on specific nodes on cluster but not on all nodes. And sometimes a different pod on the same node already has a pvc mounted.
Environment Dev/Technical Test/Acceptance
- Trident version: happend on v21.01.2 and v21.04.0
- Trident installation flags used: debug: false and autsupport:false
- Container runtime: OpenShift 4.7.7
- Kubernetes version: 1.20
- Kubernetes orchestrator: OpenShift 4.7.7
- Kubernetes enabled feature gates: none
- OS: Rhel CoreOS
- NetApp backend types: OnTap
- Other:
To Reproduce Steps to reproduce the behavior: Mount iscsi pvc
Expected behavior Mount should always work without any issues.
Additional context Happend on two Cluster that all use the same SVM
@Numblesix did you have useCHAP set to True? Can you open up a support ticket with NetApp so we can get the Trident logs and look into this further?
To open a case with NetApp, please go to https://mysupport.netapp.com/site/.
- Bottom left, Click on 'Contact Support'
- Find the appropriate number from your region to call in, or login.
- Note: Trident is a supported product by NetApp based on a supported Netapp storage SN.
- Open the case on the NetApp storage SN, and provide the description of the problem.
- Be sure to mention the product is Trident on Kubernetes, and provide the details. Mention this GitHub.
- The case will be directed to Trident support engineers for response.
@balaramesh yes useChap was set to true in the backend-config :)
Ill try to get a reproducer then ill open a case :) Now the issue solved it self by rebooting the node T_T
hi @Numblesix If you venture out on reproducing this, can you note down the entries under
1.Once the issue encountered while containercreating with this error message
check for chap credentials undewr
/var/lib/iscsi/nodes/<pvc_name>/<storage LIF IP:3260>/default file.
node.session.nr_sessions = 1 node.session.auth.authmethod = None node.session.auth.chap_algs = MD5 node.session.timeo.replacement_timeout = 120 2. Rename this file preferably first and retry. if it works , check the entries of this file and see if CHAP information is populated Example :
node.session.auth.authmethod = CHAP node.session.auth.username = OS-Sandbox node.session.auth.password = abcdefg123!@# node.session.auth.chap_algs = MD5 node.session.timeo.replacement_timeout = 120
If you are able to successfully reproduce , do collect following along with trident logs and reach out to netapp support as Bala suggested.
- Sosreports ( through trident debug pod --Toolbox container
- Kubelet /OC logs If you plan to enable debug mode in trident , do collect a set of logs prior to this as this will recreate the containers.
Good luck with the reproduction and we will help you once a support case is created.
So i couldnt reproduce myself but the issue happend again strangly on the same node again.
I checked the file and CHAP Creds are set correct.
Trying to move and rename the file didnt end up in new files beeing created :(
Logs also dont show anything strange with debug disabled.
Ive scheduled the container on to a different host and it directly worked.
I also noted that when i restart the pod on another node i always get this msg in the events:
Multi-Attach error for volume "pvc-38c3d63c-472c-4f15-94e7-90a8cae49acf" Volume is already exclusively attached to one node and can't be attached to another
Which would mean that the PVC was in fact mounted on the "original" node
Today i also found this issue in the Logs: MountVolume.MountDevice failed for volume "pvc-62d2061e-7a19-4ba0-99ad-a470d638499d" : rpc error: code = Internal desc = iSCSI login failed using CHAP
Seems my issue was: https://netapp-trident.readthedocs.io/en/stable-v21.04/kubernetes/operations/tasks/worker.html#iscsi
I had more then MD5 in my iscsid.conf even tough my Backend is an onTAP System it still had issues.
Maybe those "issues" should be part of the Release Notes @gnarl ?
@Numblesix,
That statement to add MD5 for an ElementOS(a/k/a SolidFire) backend was done due to a change in ElementOS. It seems odd that it is now causing an issue with OCP + ONTAP. Can you provide the config that you had prior to modifying it? We always want to reproduce the issue if possible.
sure :) , i applied the change now to 3 of 4 cluster and i will see if everything works better now Config before:
iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
node.startup = automatic
node.leading_login = No
node.session.auth.chap_algs = SHA3-256,SHA256,SHA1,MD5
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.reopen_max = 0
node.session.iscsi.FastAbort = Yes
node.session.scan = auto
Config after:
iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
node.startup = automatic
node.leading_login = No
node.session.auth.chap_algs = MD5
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.reopen_max = 0
node.session.iscsi.FastAbort = Yes
node.session.scan = auto
Maybe also interesting my Netapp is running NetApp Release 9.6P1
So usually those Errors error decrypting iscsi username always apeared while doing an OpenShift update.
Today no Issue, so i really think that onTAP Users should also only rely on MD5 for the time beeing :)
@Numblesix, thanks for adding the detail that this was only seen during an OCP update.
@gnarl not only then i would have to add. But it always happend while doing an update because every pod gets recreated