ceph-iscsi-cli icon indicating copy to clipboard operation
ceph-iscsi-cli copied to clipboard

Kernel hang with RHEL/Centos 7.5 when ABORTs are sent.

Open Jazz9 opened this issue 6 years ago • 18 comments

Hello,

Centos 7.5 VMware 6.0 U3 Ceph 12.2.5 with all SSD bluestore pool. ( only 20 OSD in SSD pool ) Active/standby iscsi

I have started using ceph-iscsi-cli in a small - non important production environment and last night both iscsi nodes crashed. A reboot fixed them ( they wouldn't reboot without a hard reset. The OS seemed fine but a hung task is my guess )

I am not 100% sure where to look for the logs that relate to the issue. I will put more monitoring on the system.

May 25 15:54:51 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 45 May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: 2018-05-25 15:55:11.794 856 [ERROR] tcmu_notify_conn_lost:187 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Handler connection lost (lock state 0) May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: tcmu_notify_conn_lost:187 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Handler connection lost (lock state 0) May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 45 May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 45 May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: 2018-05-25 15:55:11.798 856 [INFO] tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_2. May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_2. May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:15 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:15 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:18 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:18 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:21 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:21 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:24 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:24 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:27 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 15:55:27 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 15:55:29 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw

May 25 16:27:47 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 16:27:47 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 16:27:49 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw May 25 16:27:49 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed. May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: INFO: task fn-radosclient:1555 blocked for more than 120 seconds. May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: fn-radosclient D ffff9df8b2218fd0 0 1555 1 0x00000080 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: Call Trace: May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule+0x29/0x70 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule_timeout+0x239/0x2c0 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? __send_signal+0x18e/0x490 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] wait_for_completion+0xfd/0x140 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_state+0x20/0x20 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_cause_connection_reinstatement+0x9e/0x100 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_free_session+0x109/0x180 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_release_sessions_for_tpg+0x123/0x1e0 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_tpg_disable_portal_group+0xcf/0x1e0 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] lio_target_tpg_enable_store+0x6e/0xf0 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] configfs_write_file+0x107/0x140 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] vfs_write+0xc0/0x1f0 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] SyS_write+0x7f/0xf0 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] system_call_fastpath+0x1c/0x21 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: INFO: task iscsi_ttx:1528 blocked for more than 120 seconds. May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: iscsi_ttx D ffff9df7b194dee0 0 1528 2 0x00000084 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: Call Trace: May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule+0x29/0x70 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule_timeout+0x239/0x2c0 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? list_del+0xd/0x30 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] wait_for_completion+0xfd/0x140 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_state+0x20/0x20 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] transport_generic_free_cmd+0xa2/0x150 [target_core_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_free_cmd+0x82/0x140 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_close_connection+0x4c6/0x8a0 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_take_action_for_connection_exit+0x8b/0x120 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsi_target_tx_thread+0x1f2/0x240 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_atomic_t+0x30/0x30 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? iscsit_thread_get_cpumask+0xf0/0xf0 [iscsi_target_mod] May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] kthread+0xd1/0xe0 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? insert_kthread_work+0x40/0x40 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ret_from_fork_nospec_begin+0x21/0x21 May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? insert_kthread_work+0x40/0x40

I think these may be unrelated but interesting.

May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.539 852 [WARN] tcmu_block_device:395 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action. May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_block_device:395 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action. May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.555 852 [WARN] tcmu_rbd_lock:735 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Acquired exclusive lock. May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_rbd_lock:735 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Acquired exclusive lock. May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_unblock_device:418 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action. May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.563 852 [WARN] tcmu_unblock_device:418 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action.

May 25 17:06:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:06:22.619 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:06:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:16:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:16:22.546 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:16:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:30:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:30:57.074 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:30:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:36:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:36:22.638 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:36:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:46:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:46:22.545 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 17:46:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 18:00:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 18:00:57.089 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported May 25 18:00:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported

Jazz9 avatar May 26 '18 04:05 Jazz9

A little bit more info... Running - tcmu-runner-1.3.0-89.g7d0023e.el7.x86_64

When the lockup occurs you can't run systemctl restart tcmu-runner as it never returns. I believe that is what is preventing reboot also.

Jazz9 avatar May 26 '18 05:05 Jazz9

Also noted that I am running the Centos 7.5 standard install of targetcli. Its version is slightly lower then the requirements on the ceph master docs. Do you think this is the issue and do you know where a pre compiled newer version may be found?

targetcli.noarch 2.1.fb46-4.el7_5

Jazz9 avatar May 26 '18 05:05 Jazz9

You are using the RHEL/Centos 7.5 kernel right?

It looks like you are hitting a bug in that kernel. It is missing this patch:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/target?id=527268df31e57cf2b6d417198717c6d6afdb1e3e

We are working on getting a fix released. It is merged in our internal repos and being QA'd right now. It should be released in about a month.

I will update this issue with the updated kernel version.

mikechristie avatar May 30 '18 16:05 mikechristie

Ok, that sounds good. If you want me to test it - I can. It seems that both iscsi nodes crash every few days.

Jazz9 avatar May 31 '18 01:05 Jazz9

and yes - Centos 7.5 with latest updates.

Jazz9 avatar May 31 '18 01:05 Jazz9

Did this get merged?

Jazz9 avatar Jun 22 '18 02:06 Jazz9

It is in kernel-3.10.0-862.5 which should be released soon.

I do not know how long it takes to go from a RHEL z stream to Centos kernel though.

mikechristie avatar Jun 22 '18 04:06 mikechristie

Seems the latest Centos patches helped. No more crashes.

Jazz9 avatar Jul 10 '18 04:07 Jazz9

Hello,

We have the same error with a 4.18.9-1 kernel with a single gateway (without multipathing) and ceph 13.2.0 . Has this error been identified? Is there a patch?

Sep 25 07:19:00 src-ceph-gwiscsi1 tcmu-runner: 2018-09-25 07:19:00.359 1204 [DEBUG] alua_implicit_transition:568 rbd/rbd.VMW_CEPH_DS08: lock state 1 Sep 25 07:19:09 src-ceph-gwiscsi1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 71303168 Sep 25 07:20:01 src-ceph-gwiscsi1 systemd: Started Session 190 of user root. Sep 25 07:20:01 src-ceph-gwiscsi1 systemd: Starting Session 190 of user root. Sep 25 07:20:27 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:20:27 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:20:43 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:20:43 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:20:58 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:20:58 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:21:14 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:21:14 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:21:29 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:21:29 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:21:44 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:21:44 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: INFO: task kworker/u256:2:95199 blocked for more than 120 seconds. Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: Not tainted 4.18.9-1.el7.elrepo.x86_64 #1 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: kworker/u256:2 D 0 95199 2 0x80000080 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: Workqueue: tmr-user target_tmr_work [target_core_mod] Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: Call Trace: Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: __schedule+0x2ab/0x880 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? kvasprintf+0x7f/0xd0 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: schedule+0x36/0x80 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: schedule_timeout+0x161/0x300 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? __next_timer_interrupt+0xe0/0xe0 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: wait_for_completion_timeout+0x130/0x1a0 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? wake_up_q+0x80/0x80 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: __transport_wait_for_tasks+0xdd/0x140 [target_core_mod] Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: transport_wait_for_tasks+0x53/0x90 [target_core_mod] Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: core_tmr_abort_task+0xfb/0x180 [target_core_mod] Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: target_tmr_work+0x128/0x140 [target_core_mod] Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: process_one_work+0x179/0x390 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: worker_thread+0x4f/0x3e0 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: kthread+0x105/0x140 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? max_active_store+0x80/0x80 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? kthread_bind+0x20/0x20 Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ret_from_fork+0x35/0x40 Sep 25 07:22:00 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:22:00 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:22:15 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:22:15 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:22:30 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:22:30 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:22:46 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:22:46 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:23:01 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:23:01 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:23:16 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:23:16 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:23:32 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:23:32 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:23:47 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Sep 25 07:23:47 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Sep 25 07:23:53 src-ceph-gwiscsi1 kernel: INFO: task kworker/0:0:111368 blocked for more than 120 seconds.

VRAseb avatar Sep 25 '18 19:09 VRAseb

@VRAseb

You are hitting a similar issue in that a command is taking a long time which leads to the initiator timing it out and sending an abort.

You not hitting the same issue that was being discussed in this GH issue though. The other user was hitting a issue that only occurred in the RHEL kernel because it was missing a patch. Because of this even when the command finally unjammed it self and completed the kernel hit a bug and would never unjam.

In your case from the logs above it looks like the command just never completed. You would want to look in the ceph logs to check if something happened to the cluster or if tcmu-runner crashed so it could not complete the command.

mikechristie avatar Sep 26 '18 16:09 mikechristie

Hello, the incident we encounter blocks the service rbd-target-gw. We can not restart the rbd-target-gw service. We can not restart OS, the kernel is blocked. We have no alerts in CEPH journals (HEALTH_OK)

We tried with the tcmu-runner 1.4.0. We have the same incident.

Can you help us?

VRAseb avatar Sep 28 '18 12:09 VRAseb

The rbd-target-gw service and kernel are blocked on the stuck command.

Is tcmu-runner running (just do a 'ps -u root | grep runner')?

mikechristie avatar Oct 01 '18 23:10 mikechristie

Hello, tcmu-runner does not run after crash.

VRAseb avatar Oct 03 '18 11:10 VRAseb

tcmu-runner does not run after crash.

Do you mean when you see the kernel message about the hung_task_timeout_sec, tcmu-runner is not running still? Has it crashed? If so, could you get me the core dump if there is one or get the /var/log/messages around the time tcmu-runner crashes so we can see if there is a stack dump from that? Could you also get me the /var/log/tcmu-runner.log for around the same time?

mikechristie avatar Oct 04 '18 17:10 mikechristie

Hello,

messages : Oct 4 05:45:01 src-ceph-gwiscsi1 systemd: Started Session 283 of user root. Oct 4 05:45:01 src-ceph-gwiscsi1 systemd: Starting Session 283 of user root. Oct 4 05:49:25 src-ceph-gwiscsi1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 97406060 Oct 4 05:49:35 src-ceph-gwiscsi1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 97406062 Oct 4 05:50:01 src-ceph-gwiscsi1 systemd: Started Session 284 of user root. Oct 4 05:50:01 src-ceph-gwiscsi1 systemd: Starting Session 284 of user root. Oct 4 05:51:02 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:51:02 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:51:02 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:51:17 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:51:17 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:51:17 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:51:33 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:51:33 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:51:33 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:51:48 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:51:48 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:51:48 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:52:03 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:52:03 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:52:03 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:52:19 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:52:19 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:52:19 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:52:34 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:52:34 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:52:34 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: INFO: task iscsi_trx:2513 blocked for more than 120 seconds. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Not tainted 4.18.9-1.el7.elrepo.x86_64 #1 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_trx D 0 2513 2 0x80000080 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Call Trace: Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: __schedule+0x2ab/0x880 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? kvasprintf+0x7f/0xd0 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule+0x36/0x80 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule_timeout+0x161/0x300 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? __next_timer_interrupt+0xe0/0xe0 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: wait_for_completion_timeout+0x130/0x1a0 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? wake_up_q+0x80/0x80 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: __transport_wait_for_tasks+0xdd/0x140 [target_core_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: target_wait_free_cmd+0x56/0x90 [target_core_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: transport_generic_free_cmd+0x103/0x140 [target_core_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_free_cmd+0x43/0x90 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_close_connection+0x586/0x8f0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_rx_thread+0xe2/0xf0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: kthread+0x105/0x140 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? iscsi_target_tx_thread+0x1f0/0x1f0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? kthread_bind+0x20/0x20 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ret_from_fork+0x35/0x40 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: INFO: task kworker/0:0:164935 blocked for more than 120 seconds. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Not tainted 4.18.9-1.el7.elrepo.x86_64 #1 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: kworker/0:0 D 0 164935 2 0x80000080 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Workqueue: events iscsi_target_do_login_rx [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Call Trace: Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: __schedule+0x2ab/0x880 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule+0x36/0x80 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule_timeout+0x1dc/0x300 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: wait_for_completion+0x121/0x180 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? wake_up_q+0x80/0x80 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_stop_session+0x1b1/0x1c0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_check_for_session_reinstatement+0x1ed/0x280 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_check_for_existing_instances+0x30/0x40 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_do_login+0x31d/0x5f0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_do_login_rx+0x26f/0x340 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? iscsi_target_check_login_request+0x1a0/0x1a0 [iscsi_target_mod] Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: process_one_work+0x179/0x390 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: worker_thread+0x4f/0x3e0 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: kthread+0x105/0x140 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? max_active_store+0x80/0x80 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? kthread_bind+0x20/0x20 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ret_from_fork+0x35/0x40 Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48. Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.

The tcmu log stops at the time of the crash.

VRAseb avatar Oct 05 '18 13:10 VRAseb

I think we are talking about different things. The log snippet you posted above is not a crash. It is a warning indicating that a command is stuck.

I am asking if when you see this is tcmu-runner running? If you run

systemctl status tcmu-runner

does it show it is running?

If it is then the command is could be stuck in the ceph cluster somewhere. You then should check the /var/log/ceph logs on the OSDs to see if there are any errors in there.

Also what tools have you used to set this up? Did you use the ceph-iscsi tools or did you run targetlci manually?

mikechristie avatar Oct 06 '18 18:10 mikechristie

Oh yeah, if you are setting it up yourself with targetcli set the osd_op_timeout to something like 15 seconds. This will force most commands to fail if they have not completed within that timeout. We can then verify that they are getting sent to the OSD and getting stuck there.

mikechristie avatar Oct 06 '18 18:10 mikechristie

Hi @mikechristie, I investigated this case and this is what I can tell :

  • we didn't see any problem on ceph cluster (as yet)
  • tcmu is not running when, we look in the VM
  • this was setup with ceph-iscsi tools
  • ESX is connected to the gateway and is sending ABORT_TASK after something like 5s and later start sending Login requests
  • System require a reset to restart

From what I see, I agree with your first idea that tcmu crashed and never complete the command.

Could you give instruction to get the core dump of tcmu when the problem happen ? What can we provide to help ?

manens avatar Oct 09 '18 22:10 manens