ceph-nvmeof
ceph-nvmeof copied to clipboard
Failover failed on 4 gateway configuration with ceph orch daemon stop
Failover of 1 gateway using ceph orch daemon stop command failed in a 4 gateway configuration, fio gets stuck and nvme disks disappear from client node after a while.
Details: Before failover
[root@ceph-mytest-578wbg-node8 ~]# ceph nvme-gw show nvmeof ''
{
"pool": "nvmeof",
"group": "",
"num gws": 4,
"Anagrp list": "[ 4 1 2 3 ]"
}
{
"gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node4.xpqheq",
"anagrp-id": 4,
"last-gw_map-epoch-valid": 1,
"Availability": "AVAILABLE",
"ana states": " 4: ACTIVE , 1: STANDBY , 2: STANDBY , 3: STANDBY ,"
}
{
"gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node5.mrlsib",
"anagrp-id": 1,
"last-gw_map-epoch-valid": 1,
"Availability": "AVAILABLE",
"ana states": " 4: STANDBY , 1: ACTIVE , 2: STANDBY , 3: STANDBY ,"
}
{
"gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node6.gyebay",
"anagrp-id": 2,
"last-gw_map-epoch-valid": 1,
"Availability": "AVAILABLE",
"ana states": " 4: STANDBY , 1: STANDBY , 2: ACTIVE , 3: STANDBY ,"
}
{
"gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node7.kfsrzw",
"anagrp-id": 3,
"last-gw_map-epoch-valid": 1,
"Availability": "AVAILABLE",
"ana states": " 4: STANDBY , 1: STANDBY , 2: STANDBY , 3: ACTIVE ,"
}
GW1
[root@ceph-mytest-578wbg-node4 app]# /usr/libexec/spdk/scripts/rpc.py nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode3 | head -n 24
[
{
"address": {
"trtype": "TCP",
"adrfam": "IPv4",
"traddr": "10.0.210.37",
"trsvcid": "4420"
},
"ana_states": [
{
"ana_group": 1,
"ana_state": "inaccessible"
},
{
"ana_group": 2,
"ana_state": "inaccessible"
},
{
"ana_group": 3,
"ana_state": "inaccessible"
},
{
"ana_group": 4,
"ana_state": "optimized"
GW2
[root@ceph-mytest-578wbg-node5 app]# /usr/libexec/spdk/scripts/rpc.py nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode3 | head -n 24
[
{
"address": {
"trtype": "TCP",
"adrfam": "IPv4",
"traddr": "10.0.208.71",
"trsvcid": "4420"
},
"ana_states": [
{
"ana_group": 1,
"ana_state": "optimized"
},
{
"ana_group": 2,
"ana_state": "inaccessible"
},
{
"ana_group": 3,
"ana_state": "inaccessible"
},
{
"ana_group": 4,
"ana_state": "inaccessible"
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
GW3
[root@ceph-mytest-578wbg-node6 app]# /usr/libexec/spdk/scripts/rpc.py nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode3 | head -n 24
[
{
"address": {
"trtype": "TCP",
"adrfam": "IPv4",
"traddr": "10.0.208.171",
"trsvcid": "4420"
},
"ana_states": [
{
"ana_group": 1,
"ana_state": "inaccessible"
},
{
"ana_group": 2,
"ana_state": "optimized"
},
{
"ana_group": 3,
"ana_state": "inaccessible"
},
{
"ana_group": 4,
"ana_state": "inaccessible"
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
GW4
[root@ceph-mytest-578wbg-node7 app]# /usr/libexec/spdk/scripts/rpc.py nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode3 | head -n 24
[
{
"address": {
"trtype": "TCP",
"adrfam": "IPv4",
"traddr": "10.0.211.128",
"trsvcid": "4420"
},
"ana_states": [
{
"ana_group": 1,
"ana_state": "inaccessible"
},
{
"ana_group": 2,
"ana_state": "inaccessible"
},
{
"ana_group": 3,
"ana_state": "optimized"
},
{
"ana_group": 4,
"ana_state": "inaccessible"
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
[root@ceph-mytest-578wbg-node8 ~]# nvme list-subsys /dev/nvme9n1
nvme-subsys9 - NQN=nqn.2016-06.io.spdk:cnode3
\
+- nvme10 tcp traddr=10.0.208.71,trsvcid=4420,src_addr=10.0.209.227 live inaccessible
+- nvme11 tcp traddr=10.0.208.171,trsvcid=4420,src_addr=10.0.209.227 live inaccessible
+- nvme12 tcp traddr=10.0.211.128,trsvcid=4420,src_addr=10.0.209.227 live optimized
+- nvme9 tcp traddr=10.0.210.37,trsvcid=4420,src_addr=10.0.209.227 live inaccessible
Running IOs on all disks exposed through nvme from all 4 gateways.
IOs start executing on GW 4 for the corresponding subsystem and namespace.
[root@ceph-mytest-578wbg-node7 ~]# podman run cp.stg.icr.io/cp/ibm-ceph/nvmeof-cli-rhel9:1.2.0-1 --server-address 10.0.211.128 --server-port 5500 namespace get_io_stats -n nqn.2016-06.io.spdk:cnode3 --nsid 1
IO statistics for namespace 1 in nqn.2016-06.io.spdk:cnode3, bdev bdev_42bf0041-2132-449f-8575-2c33200524e3:
╒═════════════════════════╤══════════════════╕
│ Stat │ Value │
╞═════════════════════════╪══════════════════╡
│ Tick Rate │ 2190000000 │
├─────────────────────────┼──────────────────┤
│ Ticks │ 1267966440193871 │
├─────────────────────────┼──────────────────┤
│ Bytes Read │ 7873536 │
├─────────────────────────┼──────────────────┤
│ Num Read Ops │ 641 │
├─────────────────────────┼──────────────────┤
│ Bytes Written │ 10977280 │
├─────────────────────────┼──────────────────┤
│ Num Write Ops │ 443 │
├─────────────────────────┼──────────────────┤
│ Bytes Unmapped │ 536870916096 │
├─────────────────────────┼──────────────────┤
│ Num Unmap Ops │ 251 │
├─────────────────────────┼──────────────────┤
│ Read Latency Ticks │ 2453266074 │
├─────────────────────────┼──────────────────┤
│ Max Read Latency Ticks │ 18026596 │
├─────────────────────────┼──────────────────┤
│ Min Read Latency Ticks │ 45078 │
├─────────────────────────┼──────────────────┤
│ Write Latency Ticks │ 1539357786 │
├─────────────────────────┼──────────────────┤
│ Max Write Latency Ticks │ 9251540 │
├─────────────────────────┼──────────────────┤
│ Min Write Latency Ticks │ 146976 │
├─────────────────────────┼──────────────────┤
│ Unmap Latency Ticks │ 1744197656 │
├─────────────────────────┼──────────────────┤
│ Max Unmap Latency Ticks │ 13853052 │
├─────────────────────────┼──────────────────┤
│ Min Unmap Latency Ticks │ 227046 │
├─────────────────────────┼──────────────────┤
│ Copy Latency Ticks │ 0 │
├─────────────────────────┼──────────────────┤
│ Max Copy Latency Ticks │ 0 │
├─────────────────────────┼──────────────────┤
│ Min Copy Latency Ticks │ 0 │
├─────────────────────────┼──────────────────┤
│ IO Error │ [] │
╘═════════════════════════╧══════════════════╛
[root@ceph-mytest-578wbg-node8 ~]# rbd perf image iostat nvme_test
rbd: waiting for initial image stats
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
image_2 154/s 0/s 279 MiB/s 0 B/s 30.57 ms 0.00 ns
image_11 1/s 0/s 55 KiB/s 0 B/s 19.83 ms 0.00 ns
image_7 1/s 0/s 54 KiB/s 0 B/s 12.25 ms 0.00 ns
image_6 1/s 0/s 4.8 KiB/s 0 B/s 13.83 ms 0.00 ns
image_8 1/s 0/s 4 KiB/s 0 B/s 14.87 ms 0.00 ns
image_9 0/s 0/s 3.2 KiB/s 0 B/s 6.42 ms 0.00 ns
image_10 0/s 0/s 2.4 KiB/s 0 B/s 11.29 ms 0.00 ns
image_1 0/s 0/s 1.6 KiB/s 0 B/s 4.31 ms 0.00 ns
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
image_2 151/s 0/s 279 MiB/s 0 B/s 29.48 ms 0.00 ns
image_6 1/s 0/s 54 KiB/s 0 B/s 13.15 ms 0.00 ns
image_9 1/s 0/s 30 KiB/s 0 B/s 29.43 ms 0.00 ns
image_8 1/s 0/s 3.2 KiB/s 0 B/s 17.59 ms 0.00 ns
image_11 1/s 0/s 3.2 KiB/s 0 B/s 27.63 ms 0.00 ns
image_7 1/s 0/s 4.8 KiB/s 0 B/s 14.59 ms 0.00 ns
image_10 0/s 0/s 2.4 KiB/s 0 B/s 3.48 ms 0.00 ns
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
image_2 154/s 0/s 283 MiB/s 0 B/s 31.99 ms 0.00 ns
image_9 1/s 0/s 29 KiB/s 0 B/s 15.56 ms 0.00 ns
image_11 1/s 0/s 28 KiB/s 0 B/s 17.19 ms 0.00 ns
image_6 1/s 0/s 29 KiB/s 0 B/s 23.08 ms 0.00 ns
image_7 1/s 0/s 29 KiB/s 0 B/s 25.07 ms 0.00 ns
image_8 0/s 0/s 2.4 KiB/s 0 B/s 10.05 ms 0.00 ns
image_10 0/s 0/s 28 KiB/s 0 B/s 11.89 ms 0.00 ns
Failover failed Performed failover of GW4 using ceph orch daemon stop command.
[root@ceph-mytest-578wbg-node8 ~]# ceph orch daemon stop nvmeof.nvmeof.ceph-mytest-578wbg-node7.kfsrzw
Scheduled to stop nvmeof.nvmeof.ceph-mytest-578wbg-node7.kfsrzw on host 'ceph-mytest-578wbg-node7'
[root@ceph-mytest-578wbg-node8 ~]# ceph orch ps --daemon-type nvmeof
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
nvmeof.nvmeof.ceph-mytest-578wbg-node4.xpqheq ceph-mytest-578wbg-node4 *:5500,4420,8009 running (21h) 5m ago 2d 188M - b09894a2fc25 2cf13c8b0fcd
nvmeof.nvmeof.ceph-mytest-578wbg-node5.mrlsib ceph-mytest-578wbg-node5 *:5500,4420,8009 running (47h) 5m ago 2d 264M - b09894a2fc25 c8cceabe0e64
nvmeof.nvmeof.ceph-mytest-578wbg-node6.gyebay ceph-mytest-578wbg-node6 *:5500,4420,8009 running (23h) 5m ago 2d 173M - b09894a2fc25 0b6891ad1212
nvmeof.nvmeof.ceph-mytest-578wbg-node7.kfsrzw ceph-mytest-578wbg-node7 *:5500,4420,8009 error 2s ago 2d - - <unknown> <unknown> <unknown>
Ana id 3 is now picked by GW1
[root@ceph-mytest-578wbg-node8 ~]# ceph nvme-gw show nvmeof ''
{
"pool": "nvmeof",
"group": "",
"num gws": 4,
"Anagrp list": "[ 4 1 2 3 ]"
}
{
"gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node4.xpqheq",
"anagrp-id": 4,
"last-gw_map-epoch-valid": 1,
"Availability": "AVAILABLE",
"ana states": " 4: ACTIVE , 1: STANDBY , 2: STANDBY , 3: ACTIVE ,"
}
{
"gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node5.mrlsib",
"anagrp-id": 1,
"last-gw_map-epoch-valid": 1,
"Availability": "AVAILABLE",
"ana states": " 4: STANDBY , 1: ACTIVE , 2: STANDBY , 3: STANDBY ,"
}
{
"gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node6.gyebay",
"anagrp-id": 2,
"last-gw_map-epoch-valid": 1,
"Availability": "AVAILABLE",
"ana states": " 4: STANDBY , 1: STANDBY , 2: ACTIVE , 3: STANDBY ,"
}
{
"gw-id": "client.nvmeof.nvmeof.ceph-mytest-578wbg-node7.kfsrzw",
"anagrp-id": 3,
"last-gw_map-epoch-valid": 1,
"Availability": "UNAVAILABLE",
"ana states": " 4: STANDBY , 1: STANDBY , 2: STANDBY , 3: STANDBY ,"
}
GW1
[root@ceph-mytest-578wbg-node4 app]# /usr/libexec/spdk/scripts/rpc.py nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode3 | head -n 24
[
{
"address": {
"trtype": "TCP",
"adrfam": "IPv4",
"traddr": "10.0.210.37",
"trsvcid": "4420"
},
"ana_states": [
{
"ana_group": 1,
"ana_state": "inaccessible"
},
{
"ana_group": 2,
"ana_state": "inaccessible"
},
{
"ana_group": 3,
"ana_state": "optimized"
},
{
"ana_group": 4,
"ana_state": "optimized"
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
However IOs get stuck for more than 3 hours and are not run by GW1
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
image_10 0/s 1/s 0 B/s 54 KiB/s 0.00 ns 163.79 us
image_11 0/s 1/s 0 B/s 54 KiB/s 0.00 ns 191.56 us
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
image_8 0/s 1/s 0 B/s 54 KiB/s 0.00 ns 175.21 us
image_7 0/s 0/s 0 B/s 819 B/s 0.00 ns 201.62 us
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
image_7 0/s 1/s 0 B/s 54 KiB/s 0.00 ns 51.91 us
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
Also eventually after some time, the disks being served originally by gateway1 (/dev/nvme13n series) and also the disks it picked up after gateway 4’s failure (/dev/nvme9n series), both disappear from the client.
[root@ceph-mytest-578wbg-node8 ~]# nvme list
Node Generic SN Model Namespace Usage Format FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme5n3 /dev/ng5n3 2 Ceph bdev Controller 0x3 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme5n2 /dev/ng5n2 2 Ceph bdev Controller 0x2 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme5n1 /dev/ng5n1 2 Ceph bdev Controller 0x1 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme1n3 /dev/ng1n3 1 Ceph bdev Controller 0x3 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme1n2 /dev/ng1n2 1 Ceph bdev Controller 0x2 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme1n1 /dev/ng1n1 1 Ceph bdev Controller 0x1 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
@manasagowri please provide the following:
- Which build are you using? Which images for GW and for Ceph.
- What is the initiator host os?
- Did you reboot the host before running the test?
- Please provide output of "dmesg -T" from the host.
- Please provide "ceph-mon*" logs from all of the nodes.
- Please provide the gw logs.
@caroav As discussed yesterday, I retried the test with the latest downstream build and with rhel version 9.3 and issue is not seen. I am able to successfully failover and failback. We can close this issue for now, if I see any bug again will open a new one.