ceph-iscsi icon indicating copy to clipboard operation
ceph-iscsi copied to clipboard

rbd-target-api stuck, tcmu-runner crash / iSCSI Login negotiation failed

Open discostur opened this issue 1 year ago • 0 comments

Hi,

from time to time i see that my docker tcmu-runner container is exiting and restarting. After that, i get lots of

Sep 25 12:39:52 ceph-osd02 kernel: Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
Sep 25 12:39:52 ceph-osd02 kernel: iSCSI Login negotiation failed.
Sep 25 12:39:52 ceph-osd02 kernel: Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
Sep 25 12:39:52 ceph-osd02 kernel: iSCSI Login negotiation failed.

errors on the failed tcmu-runner node. So the tcmu-runner container has restarted, is running but clients are not able to login via iscsi. The other 3x tcmu-runner nodes are running like expected. On the failed node i'm not able to restart rbd-target-api because it seems to be stuck. The only thing i can do is to reboot the whole node.

I checked all logs and it seems to occur from a slow OSD (deep scrub was running at the moment the trouble begun):

Sep 25 12:39:42 ceph-mon01 journal: cluster 2022-09-25T12:39:41.500225+0200 osd.39 (osd.39) 4190 : cluster [WRN] 4 slow requests (by type [ 'delayed' : 4 ] most affected pool [ 'ovirt' : 4 ])
Sep 25 12:39:42 ceph-mon01 docker: cluster 2022-09-25T12:39:41.250374+0200 mgr.ceph-mon01 (mgr.40357165) 8331237 : cluster [DBG] pgmap v5468170: 2370 pgs: 1 active+clean+scrubbing, 2 active+clean+laggy, 4 active+clean+scrubbing+deep, 2363 active+clean; 14 TiB data, 45 TiB used, 110 TiB / 155 TiB avail; 2.6 MiB/s rd, 10 MiB/s wr, 883 op/s
Sep 25 12:39:42 ceph-mon01 docker: cluster 2022-09-25T12:39:41.500225+0200 osd.39 (osd.39) 4190 : cluster [WRN] 4 slow requests (by type [ 'delayed' : 4 ] most affected pool [ 'ovirt' : 4 ])

tcmu-runner logs from slow IO on osd.39 till reboot:

2022-09-25 03:46:36.558 83 [CRIT] create_file_output:485: log file path now is '/var/log/tcmu-runner/tcmu-runner.log'
2022-09-25 12:39:40.136 83 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 12:39:40.137 83 [ERROR] __tcmu_notify_conn_lost:214 rbd/ovirt.lun1: Handler connection lost (lock state 5)
2022-09-25 12:39:48.872 83 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 12:39:48.883 83 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 12:39:48.884 83 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 12:39:48.885 83 [INFO] tgt_port_grp_recovery_work_fn:246: Disabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.
2022-09-25 12:40:04.916 83 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun1: appended blocklist entry: {192.168.90.32:0/3857887088}
2022-09-25 12:40:05.048 83 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun1: removing blocklist entry: {192.168.90.32:0/3857887088}
2022-09-25 12:40:05.049 83 [INFO] tgt_port_grp_recovery_work_fn:272: Enabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.
2022-09-25 12:40:06.892 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 5
2022-09-25 12:40:06.893 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 1
2022-09-25 12:40:06.911 83 [INFO] alua_implicit_transition:592 rbd/ovirt.lun1: Starting write lock acquisition operation.
2022-09-25 12:40:06.912 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.912 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.913 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.913 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.913 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.923 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.941 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.107 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.107 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.108 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.108 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.108 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.518 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.518 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.518 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 13:56:26.504 83 [CRIT] main:1425: Exiting...
2022-09-25 13:56:26.612 83 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun3: appended blocklist entry: {192.168.90.32:0/3847743111}
2022-09-25 13:56:26.635 83 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun2: appended blocklist entry: {192.168.90.32:0/1152438333}
2022-09-25 13:59:45.551 82 [CRIT] main:1302: Starting...
2022-09-25 13:59:45.593 82 [INFO] load_our_module:576: Inserted module 'target_core_user'
2022-09-25 13:59:45.594 82 [INFO] tcmur_register_handler:92: Handler fbo is registered
2022-09-25 13:59:45.596 82 [INFO] tcmur_register_handler:92: Handler zbc is registered
2022-09-25 13:59:45.616 82 [INFO] tcmur_register_handler:92: Handler glfs is registered
2022-09-25 13:59:45.616 82 [INFO] tcmur_register_handler:92: Handler qcow is registered
2022-09-25 13:59:45.676 82 [INFO] tcmur_register_handler:92: Handler rbd is registered
2022-09-25 14:00:53.917 82 [INFO] alua_implicit_transition:592 rbd/ovirt.lun1: Starting write lock acquisition operation.
2022-09-25 14:00:54.643 82 [INFO] tcmu_rbd_lock:948 rbd/ovirt.lun1: Acquired exclusive lock.
2022-09-25 14:00:54.643 82 [INFO] tcmu_acquire_dev_lock:489 rbd/ovirt.lun1: Write lock acquisition successful
2022-09-25 14:00:57.362 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 5
2022-09-25 14:00:57.362 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 1
2022-09-25 14:00:57.362 82 [INFO] alua_implicit_transition:592 rbd/ovirt.lun1: Starting write lock acquisition operation.
2022-09-25 14:00:57.362 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 14:00:57.362 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 14:00:57.367 82 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun1: appended blocklist entry: {192.168.90.32:0/3722889083}
2022-09-25 14:00:57.402 82 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun1: removing blocklist entry: {192.168.90.32:0/3722889083}
2022-09-25 14:00:58.716 82 [INFO] tcmu_acquire_dev_lock:482 rbd/ovirt.lun1: Lock acquisition unsuccessful
2022-09-25 14:00:58.719 82 [INFO] alua_implicit_transition:592 rbd/ovirt.lun1: Starting write lock acquisition operation.
2022-09-25 14:00:59.737 82 [INFO] tcmu_rbd_lock:948 rbd/ovirt.lun1: Acquired exclusive lock.
2022-09-25 14:00:59.737 82 [INFO] tcmu_acquire_dev_lock:489 rbd/ovirt.lun1: Write lock acquisition successful
2022-09-25 14:01:00.784 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 5
2022-09-25 14:01:00.784 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 1
2022-09-25 14:01:00.784 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 1
2022-09-25 14:01:00.790 82 [INFO] alua_implicit_transition:592 rbd/ovirt.lun1: Starting write lock acquisition operation.
2022-09-25 14:01:00.795 82 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun1: appended blocklist entry: {192.168.90.32:0/128875194}
2022-09-25 14:01:00.843 82 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun1: removing blocklist entry: {192.168.90.32:0/128875194}
2022-09-25 14:01:02.844 82 [INFO] tcmu_rbd_lock:948 rbd/ovirt.lun1: Acquired exclusive lock.
2022-09-25 14:01:02.844 82 [INFO] tcmu_acquire_dev_lock:489 rbd/ovirt.lun1: Write lock acquisition successful
2022-09-25 14:03:01.333 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:03:01.333 82 [ERROR] __tcmu_notify_conn_lost:214 rbd/ovirt.lun1: Handler connection lost (lock state 5)
2022-09-25 14:03:02.687 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:03:05.365 82 [INFO] tgt_port_grp_recovery_work_fn:246: Disabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.
2022-09-25 14:03:05.387 82 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun1: appended blocklist entry: {192.168.90.32:0/543220842}
2022-09-25 14:03:05.448 82 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun1: removing blocklist entry: {192.168.90.32:0/543220842}
2022-09-25 14:03:05.449 82 [INFO] tgt_port_grp_recovery_work_fn:272: Enabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.
2022-09-25 14:04:56.278 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:04:56.278 82 [ERROR] __tcmu_notify_conn_lost:214 rbd/ovirt.lun1: Handler connection lost (lock state 5)
2022-09-25 14:05:07.307 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:05:11.951 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:05:18.861 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:05:18.862 82 [INFO] tgt_port_grp_recovery_work_fn:246: Disabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.
2022-09-25 14:05:19.046 82 [INFO] tgt_port_grp_recovery_work_fn:272: Enabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.

For me it looks like tcmu-runner isn't able to send IO requests to osd.39 and then it fails. After that, target-api is stuck and iscsi clients try to login but fail. Is there anything i can do against such situations?

It is not only that i have to reboot the exporter node, on ovirt / client site i have really slow IO and VMs are pausing.

I found two other issues which seem to be the same direction (maybe other issue but tcmu-runner fails and target-api is stuck):

https://github.com/ceph/ceph-iscsi-cli/issues/102 https://github.com/ceph/ceph-iscsi/issues/169

Versions:

  • ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
  • centos 7.9
  • kernel 3.10.0-1160.66.1.el7.x86_64
  • docker deployment (latest-pacific)
  • tcmu-runner 1.5.4

OSD Config:

[osd]
osd_client_watch_timeout = 15
osd_heartbeat_grace = 20
osd_heartbeat_interval = 5

Target-CLI:

  • osd_op_timeout=30

Thanks

discostur avatar Sep 26 '22 12:09 discostur