ceph-nvmeof Failover failed on 4 gateway configuration with ceph orch daemon stop

Failover of 1 gateway using ceph orch daemon stop command failed in a 4 gateway configuration, fio gets stuck and nvme disks disappear from client node after a while.

Details: Before failover

[root@ceph-mytest-578wbg-node8 ~]# ceph nvme-gw show nvmeof ''
{
    "pool": "nvmeof",
    "group": "",
    "num gws": 4,
    "Anagrp list": "[ 4 1 2 3 ]"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node4.xpqheq",
    "anagrp-id": 4,
    "last-gw_map-epoch-valid": 1,
    "Availability": "AVAILABLE",
    "ana states": " 4: ACTIVE , 1: STANDBY , 2: STANDBY , 3: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node5.mrlsib",
    "anagrp-id": 1,
    "last-gw_map-epoch-valid": 1,
    "Availability": "AVAILABLE",
    "ana states": " 4: STANDBY , 1: ACTIVE , 2: STANDBY , 3: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node6.gyebay",
    "anagrp-id": 2,
    "last-gw_map-epoch-valid": 1,
    "Availability": "AVAILABLE",
    "ana states": " 4: STANDBY , 1: STANDBY , 2: ACTIVE , 3: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node7.kfsrzw",
    "anagrp-id": 3,
    "last-gw_map-epoch-valid": 1,
    "Availability": "AVAILABLE",
    "ana states": " 4: STANDBY , 1: STANDBY , 2: STANDBY , 3: ACTIVE ,"
}

GW1

[root@ceph-mytest-578wbg-node4 app]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode3 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.210.37",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 2,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "optimized"

GW2

[root@ceph-mytest-578wbg-node5 app]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode3 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.208.71",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "optimized"
      },
      {
        "ana_group": 2,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

GW3

[root@ceph-mytest-578wbg-node6 app]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode3 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.208.171",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 2,
        "ana_state": "optimized"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

GW4

[root@ceph-mytest-578wbg-node7 app]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode3 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.211.128",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 2,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 3,
        "ana_state": "optimized"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

[root@ceph-mytest-578wbg-node8 ~]# nvme list-subsys /dev/nvme9n1
nvme-subsys9 - NQN=nqn.2016-06.io.spdk:cnode3
\
 +- nvme10 tcp traddr=10.0.208.71,trsvcid=4420,src_addr=10.0.209.227 live inaccessible
 +- nvme11 tcp traddr=10.0.208.171,trsvcid=4420,src_addr=10.0.209.227 live inaccessible
 +- nvme12 tcp traddr=10.0.211.128,trsvcid=4420,src_addr=10.0.209.227 live optimized
 +- nvme9 tcp traddr=10.0.210.37,trsvcid=4420,src_addr=10.0.209.227 live inaccessible

Running IOs on all disks exposed through nvme from all 4 gateways.

IOs start executing on GW 4 for the corresponding subsystem and namespace.

[root@ceph-mytest-578wbg-node7 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.0-1 --server-address 10.0.211.128 --server-port 5500 namespace get_io_stats -n nqn.2016-06.io.spdk:cnode3 --nsid 1
IO statistics for namespace 1 in nqn.2016-06.io.spdk:cnode3, bdev bdev_42bf0041-2132-449f-8575-2c33200524e3:
╒═════════════════════════╤══════════════════╕
│ Stat                    │ Value            │
╞═════════════════════════╪══════════════════╡
│ Tick Rate               │ 2190000000       │
├─────────────────────────┼──────────────────┤
│ Ticks                   │ 1267966440193871 │
├─────────────────────────┼──────────────────┤
│ Bytes Read              │ 7873536          │
├─────────────────────────┼──────────────────┤
│ Num Read Ops            │ 641              │
├─────────────────────────┼──────────────────┤
│ Bytes Written           │ 10977280         │
├─────────────────────────┼──────────────────┤
│ Num Write Ops           │ 443              │
├─────────────────────────┼──────────────────┤
│ Bytes Unmapped          │ 536870916096     │
├─────────────────────────┼──────────────────┤
│ Num Unmap Ops           │ 251              │
├─────────────────────────┼──────────────────┤
│ Read Latency Ticks      │ 2453266074       │
├─────────────────────────┼──────────────────┤
│ Max Read Latency Ticks  │ 18026596         │
├─────────────────────────┼──────────────────┤
│ Min Read Latency Ticks  │ 45078            │
├─────────────────────────┼──────────────────┤
│ Write Latency Ticks     │ 1539357786       │
├─────────────────────────┼──────────────────┤
│ Max Write Latency Ticks │ 9251540          │
├─────────────────────────┼──────────────────┤
│ Min Write Latency Ticks │ 146976           │
├─────────────────────────┼──────────────────┤
│ Unmap Latency Ticks     │ 1744197656       │
├─────────────────────────┼──────────────────┤
│ Max Unmap Latency Ticks │ 13853052         │
├─────────────────────────┼──────────────────┤
│ Min Unmap Latency Ticks │ 227046           │
├─────────────────────────┼──────────────────┤
│ Copy Latency Ticks      │ 0                │
├─────────────────────────┼──────────────────┤
│ Max Copy Latency Ticks  │ 0                │
├─────────────────────────┼──────────────────┤
│ Min Copy Latency Ticks  │ 0                │
├─────────────────────────┼──────────────────┤
│ IO Error                │ []               │
╘═════════════════════════╧══════════════════╛

[root@ceph-mytest-578wbg-node8 ~]# rbd perf image iostat nvme_test
rbd: waiting for initial image stats

NAME         WR    RD    WR_BYTES   RD_BYTES     WR_LAT    RD_LAT 
image_2   154/s   0/s   279 MiB/s      0 B/s   30.57 ms   0.00 ns 
image_11    1/s   0/s    55 KiB/s      0 B/s   19.83 ms   0.00 ns 
image_7     1/s   0/s    54 KiB/s      0 B/s   12.25 ms   0.00 ns 
image_6     1/s   0/s   4.8 KiB/s      0 B/s   13.83 ms   0.00 ns 
image_8     1/s   0/s     4 KiB/s      0 B/s   14.87 ms   0.00 ns 
image_9     0/s   0/s   3.2 KiB/s      0 B/s    6.42 ms   0.00 ns 
image_10    0/s   0/s   2.4 KiB/s      0 B/s   11.29 ms   0.00 ns 
image_1     0/s   0/s   1.6 KiB/s      0 B/s    4.31 ms   0.00 ns 

NAME         WR    RD    WR_BYTES   RD_BYTES     WR_LAT    RD_LAT 
image_2   151/s   0/s   279 MiB/s      0 B/s   29.48 ms   0.00 ns 
image_6     1/s   0/s    54 KiB/s      0 B/s   13.15 ms   0.00 ns 
image_9     1/s   0/s    30 KiB/s      0 B/s   29.43 ms   0.00 ns 
image_8     1/s   0/s   3.2 KiB/s      0 B/s   17.59 ms   0.00 ns 
image_11    1/s   0/s   3.2 KiB/s      0 B/s   27.63 ms   0.00 ns 
image_7     1/s   0/s   4.8 KiB/s      0 B/s   14.59 ms   0.00 ns 
image_10    0/s   0/s   2.4 KiB/s      0 B/s    3.48 ms   0.00 ns 

NAME         WR    RD    WR_BYTES   RD_BYTES     WR_LAT    RD_LAT 
image_2   154/s   0/s   283 MiB/s      0 B/s   31.99 ms   0.00 ns 
image_9     1/s   0/s    29 KiB/s      0 B/s   15.56 ms   0.00 ns 
image_11    1/s   0/s    28 KiB/s      0 B/s   17.19 ms   0.00 ns 
image_6     1/s   0/s    29 KiB/s      0 B/s   23.08 ms   0.00 ns 
image_7     1/s   0/s    29 KiB/s      0 B/s   25.07 ms   0.00 ns 
image_8     0/s   0/s   2.4 KiB/s      0 B/s   10.05 ms   0.00 ns 
image_10    0/s   0/s    28 KiB/s      0 B/s   11.89 ms   0.00 ns

Failover failed Performed failover of GW4 using ceph orch daemon stop command.

[root@ceph-mytest-578wbg-node8 ~]# ceph orch daemon stop nvmeof.nvmeof.ceph-mytest-578wbg-node7.kfsrzw
Scheduled to stop nvmeof.nvmeof.ceph-mytest-578wbg-node7.kfsrzw on host 'ceph-mytest-578wbg-node7'

[root@ceph-mytest-578wbg-node8 ~]# ceph orch ps --daemon-type nvmeof
NAME                                           HOST                      PORTS             STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID      CONTAINER ID  
nvmeof.nvmeof.ceph-mytest-578wbg-node4.xpqheq  ceph-mytest-578wbg-node4  *:5500,4420,8009  running (21h)     5m ago   2d     188M        -             b09894a2fc25  2cf13c8b0fcd  
nvmeof.nvmeof.ceph-mytest-578wbg-node5.mrlsib  ceph-mytest-578wbg-node5  *:5500,4420,8009  running (47h)     5m ago   2d     264M        -             b09894a2fc25  c8cceabe0e64  
nvmeof.nvmeof.ceph-mytest-578wbg-node6.gyebay  ceph-mytest-578wbg-node6  *:5500,4420,8009  running (23h)     5m ago   2d     173M        -             b09894a2fc25  0b6891ad1212  
nvmeof.nvmeof.ceph-mytest-578wbg-node7.kfsrzw  ceph-mytest-578wbg-node7  *:5500,4420,8009  error             2s ago   2d        -        -  <unknown>  <unknown>     <unknown>

Ana id 3 is now picked by GW1

[root@ceph-mytest-578wbg-node8 ~]# ceph nvme-gw show nvmeof ''
{
    "pool": "nvmeof",
    "group": "",
    "num gws": 4,
    "Anagrp list": "[ 4 1 2 3 ]"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node4.xpqheq",
    "anagrp-id": 4,
    "last-gw_map-epoch-valid": 1,
    "Availability": "AVAILABLE",
    "ana states": " 4: ACTIVE , 1: STANDBY , 2: STANDBY , 3: ACTIVE ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node5.mrlsib",
    "anagrp-id": 1,
    "last-gw_map-epoch-valid": 1,
    "Availability": "AVAILABLE",
    "ana states": " 4: STANDBY , 1: ACTIVE , 2: STANDBY , 3: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node6.gyebay",
    "anagrp-id": 2,
    "last-gw_map-epoch-valid": 1,
    "Availability": "AVAILABLE",
    "ana states": " 4: STANDBY , 1: STANDBY , 2: ACTIVE , 3: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node7.kfsrzw",
    "anagrp-id": 3,
    "last-gw_map-epoch-valid": 1,
    "Availability": "UNAVAILABLE",
    "ana states": " 4: STANDBY , 1: STANDBY , 2: STANDBY , 3: STANDBY ,"
}

GW1

[root@ceph-mytest-578wbg-node4 app]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode3 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.210.37",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 2,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 3,
        "ana_state": "optimized"
      },
      {
        "ana_group": 4,
        "ana_state": "optimized"
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

However IOs get stuck for more than 3 hours and are not run by GW1

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME       WR    RD   WR_BYTES   RD_BYTES    WR_LAT      RD_LAT 
image_10  0/s   1/s      0 B/s   54 KiB/s   0.00 ns   163.79 us 
image_11  0/s   1/s      0 B/s   54 KiB/s   0.00 ns   191.56 us 

NAME      WR    RD   WR_BYTES   RD_BYTES    WR_LAT      RD_LAT 
image_8  0/s   1/s      0 B/s   54 KiB/s   0.00 ns   175.21 us 
image_7  0/s   0/s      0 B/s    819 B/s   0.00 ns   201.62 us 

NAME      WR    RD   WR_BYTES   RD_BYTES    WR_LAT     RD_LAT 
image_7  0/s   1/s      0 B/s   54 KiB/s   0.00 ns   51.91 us 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 

NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT

Also eventually after some time, the disks being served originally by gateway1 (/dev/nvme13n series) and also the disks it picked up after gateway 4’s failure (/dev/nvme9n series), both disappear from the client.

[root@ceph-mytest-578wbg-node8 ~]# nvme list
Node                  Generic               SN                   Model                                    Namespace  Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme5n3          /dev/ng5n3            2                    Ceph bdev Controller                     0x3        536.87  GB / 536.87  GB    512   B +  0 B   23.01.1 
/dev/nvme5n2          /dev/ng5n2            2                    Ceph bdev Controller                     0x2        536.87  GB / 536.87  GB    512   B +  0 B   23.01.1 
/dev/nvme5n1          /dev/ng5n1            2                    Ceph bdev Controller                     0x1        536.87  GB / 536.87  GB    512   B +  0 B   23.01.1 
/dev/nvme1n3          /dev/ng1n3            1                    Ceph bdev Controller                     0x3        536.87  GB / 536.87  GB    512   B +  0 B   23.01.1 
/dev/nvme1n2          /dev/ng1n2            1                    Ceph bdev Controller                     0x2        536.87  GB / 536.87  GB    512   B +  0 B   23.01.1 
/dev/nvme1n1          /dev/ng1n1            1                    Ceph bdev Controller                     0x1        536.87  GB / 536.87  GB    512   B +  0 B   23.01.1

Apr 19 '24 06:04 manasagowri

@manasagowri please provide the following:

Which build are you using? Which images for GW and for Ceph.
What is the initiator host os?
Did you reboot the host before running the test?
Please provide output of "dmesg -T" from the host.
Please provide "ceph-mon*" logs from all of the nodes.
Please provide the gw logs.

Apr 20 '24 16:04 caroav

@caroav As discussed yesterday, I retried the test with the latest downstream build and with rhel version 9.3 and issue is not seen. I am able to successfully failover and failback. We can close this issue for now, if I see any bug again will open a new one.

Apr 23 '24 03:04 manasagowri

ceph-nvmeof ceph-nvmeof copied to clipboard

Failover failed on 4 gateway configuration with ceph orch daemon stop

ceph-nvmeof
ceph-nvmeof copied to clipboard