trident icon indicating copy to clipboard operation
trident copied to clipboard

error decrypting iscsi username

Open Numblesix opened this issue 4 years ago • 14 comments

Describe the bug When mounting our pvc we see the following error: error decrypting iscsi username, strangly this issue does only happen on specific nodes on cluster but not on all nodes. And sometimes a different pod on the same node already has a pvc mounted.

Environment Dev/Technical Test/Acceptance

  • Trident version: happend on v21.01.2 and v21.04.0
  • Trident installation flags used: debug: false and autsupport:false
  • Container runtime: OpenShift 4.7.7
  • Kubernetes version: 1.20
  • Kubernetes orchestrator: OpenShift 4.7.7
  • Kubernetes enabled feature gates: none
  • OS: Rhel CoreOS
  • NetApp backend types: OnTap
  • Other:

To Reproduce Steps to reproduce the behavior: Mount iscsi pvc

Expected behavior Mount should always work without any issues.

Additional context Happend on two Cluster that all use the same SVM

Numblesix avatar May 05 '21 11:05 Numblesix

@Numblesix did you have useCHAP set to True? Can you open up a support ticket with NetApp so we can get the Trident logs and look into this further?

To open a case with NetApp, please go to https://mysupport.netapp.com/site/.

  1. Bottom left, Click on 'Contact Support'
  2. Find the appropriate number from your region to call in, or login.
  3. Note: Trident is a supported product by NetApp based on a supported Netapp storage SN.
  4. Open the case on the NetApp storage SN, and provide the description of the problem.
  5. Be sure to mention the product is Trident on Kubernetes, and provide the details. Mention this GitHub.
  6. The case will be directed to Trident support engineers for response.

balaramesh avatar May 05 '21 13:05 balaramesh

@balaramesh yes useChap was set to true in the backend-config :)

Ill try to get a reproducer then ill open a case :) Now the issue solved it self by rebooting the node T_T

Numblesix avatar May 05 '21 13:05 Numblesix

hi @Numblesix If you venture out on reproducing this, can you note down the entries under 1.Once the issue encountered while containercreating with this error message
check for chap credentials undewr /var/lib/iscsi/nodes/<pvc_name>/<storage LIF IP:3260>/default file.

node.session.nr_sessions = 1 node.session.auth.authmethod = None node.session.auth.chap_algs = MD5 node.session.timeo.replacement_timeout = 120 2. Rename this file preferably first and retry. if it works , check the entries of this file and see if CHAP information is populated Example :

node.session.auth.authmethod = CHAP node.session.auth.username = OS-Sandbox node.session.auth.password = abcdefg123!@# node.session.auth.chap_algs = MD5 node.session.timeo.replacement_timeout = 120

If you are able to successfully reproduce , do collect following along with trident logs and reach out to netapp support as Bala suggested.

  1. Sosreports ( through trident debug pod --Toolbox container
  2. Kubelet /OC logs If you plan to enable debug mode in trident , do collect a set of logs prior to this as this will recreate the containers.

Good luck with the reproduction and we will help you once a support case is created.

pnambees avatar May 05 '21 14:05 pnambees

So i couldnt reproduce myself but the issue happend again strangly on the same node again.

I checked the file and CHAP Creds are set correct.

Trying to move and rename the file didnt end up in new files beeing created :(

Logs also dont show anything strange with debug disabled.

Ive scheduled the container on to a different host and it directly worked.

Numblesix avatar May 11 '21 10:05 Numblesix

I also noted that when i restart the pod on another node i always get this msg in the events:

Multi-Attach error for volume "pvc-38c3d63c-472c-4f15-94e7-90a8cae49acf" Volume is already exclusively attached to one node and can't be attached to another

Which would mean that the PVC was in fact mounted on the "original" node

Numblesix avatar May 14 '21 10:05 Numblesix

Today i also found this issue in the Logs: MountVolume.MountDevice failed for volume "pvc-62d2061e-7a19-4ba0-99ad-a470d638499d" : rpc error: code = Internal desc = iSCSI login failed using CHAP

Numblesix avatar May 18 '21 08:05 Numblesix

Seems my issue was: https://netapp-trident.readthedocs.io/en/stable-v21.04/kubernetes/operations/tasks/worker.html#iscsi

I had more then MD5 in my iscsid.conf even tough my Backend is an onTAP System it still had issues.

Maybe those "issues" should be part of the Release Notes @gnarl ?

Numblesix avatar May 18 '21 13:05 Numblesix

@Numblesix,

That statement to add MD5 for an ElementOS(a/k/a SolidFire) backend was done due to a change in ElementOS. It seems odd that it is now causing an issue with OCP + ONTAP. Can you provide the config that you had prior to modifying it? We always want to reproduce the issue if possible.

gnarl avatar May 18 '21 13:05 gnarl

sure :) , i applied the change now to 3 of 4 cluster and i will see if everything works better now Config before:

iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
node.startup = automatic
node.leading_login = No
node.session.auth.chap_algs = SHA3-256,SHA256,SHA1,MD5
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.reopen_max = 0
node.session.iscsi.FastAbort = Yes
node.session.scan = auto

Config after:

iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
node.startup = automatic
node.leading_login = No
node.session.auth.chap_algs = MD5
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.reopen_max = 0
node.session.iscsi.FastAbort = Yes
node.session.scan = auto

Numblesix avatar May 18 '21 13:05 Numblesix

Maybe also interesting my Netapp is running NetApp Release 9.6P1

Numblesix avatar May 18 '21 14:05 Numblesix

So usually those Errors error decrypting iscsi username always apeared while doing an OpenShift update.

Today no Issue, so i really think that onTAP Users should also only rely on MD5 for the time beeing :)

Numblesix avatar May 20 '21 09:05 Numblesix

@Numblesix, thanks for adding the detail that this was only seen during an OCP update.

gnarl avatar May 20 '21 13:05 gnarl

@gnarl not only then i would have to add. But it always happend while doing an update because every pod gets recreated

Numblesix avatar May 20 '21 13:05 Numblesix