mayastor
mayastor copied to clipboard
1 node down in a 3 node system caused many volumes to be Degraded/Faulted without a Target node
Mayastor 2.5.1 on a 3 node system, take 1 node down and many volumes are in a rebuilding then faulting, and trying again.
mayastor-io-engine-xs9sk_mayastor_io-engine-dd23276f2aaa3971af97201157038ba42b9e57962f307d92e693116527acd457.log:2024-08-07T18:19:38.950624012-07:00 stdout F [2024-08-08T01:19:38.950553630+00:00 ERROR io_engine::rebuild::rebuild_descriptor:rebuild_descriptor.rs:85] 10.245.250.69:8420/nqn.2019-05.io.openebs:b8d8cb2e-0160-4015-931a-53c3b125447cn1: failed to get I/O handle: failed to get IO channel for 10.245.250.69:8420/nqn.2019-05.io.openebs:b8d8cb2e-0160-4015-931a-53c3b125447cn1
Mayastor pods are running:
NAME READY STATUS RESTARTS AGE
mayastor-agent-core-5859d9898c-hfjrd 2/2 Running 8 (134m ago) 4h7m
mayastor-agent-ha-node-hmnn9 1/1 Running 8 (136m ago) 39h
mayastor-agent-ha-node-zjm5c 1/1 Running 0 3h1m
mayastor-api-rest-bb7656dd7-26ckt 1/1 Running 0 177m
mayastor-api-rest-bb7656dd7-857cr 1/1 Running 1 (136m ago) 6h19m
mayastor-api-rest-bb7656dd7-rdl5m 1/1 Running 0 177m
mayastor-csi-controller-56bb85db5f-bmd9k 5/5 Running 5 (136m ago) 4h1m
mayastor-csi-node-gqknk 2/2 Running 16 (136m ago) 39h
mayastor-csi-node-p8l7n 2/2 Running 0 3h1m
mayastor-etcd-0 1/1 Running 1 (135m ago) 4h
mayastor-etcd-1 1/1 Running 3 (136m ago) 39h
mayastor-etcd-2 0/1 Pending 0 170m
mayastor-io-engine-tp6cm 2/2 Running 0 3h1m
mayastor-io-engine-xs9sk 2/2 Running 8 (48m ago) 39h
mayastor-localpv-provisioner-6fd649f5fb-bcn8g 1/1 Running 2 (136m ago) 171m
mayastor-nats-0 3/3 Running 3 (136m ago) 4h
mayastor-nats-1 3/3 Running 0 170m
mayastor-nats-2 3/3 Running 0 170m
mayastor-operator-diskpool-c6b78cbdb-bqrqx 1/1 Running 0 171m
IO-engine pod is logging:
[2024-08-08T01:16:25.047340423+00:00 ERROR io_engine::bdev::nvmx::handle:handle.rs:388] I/O completed with PI error
[2024-08-08T01:16:25.047343437+00:00 ERROR io_engine::bdev::nvmx::handle:handle.rs:388] I/O completed with PI error
[2024-08-08T01:16:25.585879277+00:00 ERROR mayastor::spdk:nvme_fabric.c:596] Connect command failed, rc -125, trtype:TCP adrfam:IPv4 traddr:10.245.250.71 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986
[2024-08-08T01:16:25.585921213+00:00 ERROR mayastor::spdk:nvme_tcp.c:2017] Failed to poll NVMe-oF Fabric CONNECT command
[2024-08-08T01:16:25.585926825+00:00 ERROR mayastor::spdk:nvme_tcp.c:1806] Failed to connect tqpair=0x5582015e9850
[2024-08-08T01:16:25.585970057+00:00 ERROR mayastor::spdk:nvme_qpair.c:797] CQ transport error -6 (No such device or address) on qpair id 3
[2024-08-08T01:16:25.585976264+00:00 ERROR io_engine::bdev::nvmx::qpair:qpair.rs:473] I/O qpair async connection polling error: UnknownErrno: Unknown errno self=QPair { qpair: 0x5582015e9850, ctrlr: 0x558201617f80, ctrlr_name: "10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1", state: Connecting, waiters: 15 }
[2024-08-08T01:16:25.585990227+00:00 ERROR io_engine::bdev::nvmx::qpair:qpair.rs:483] I/O qpair async connection failed self=QPair { qpair: 0x5582015e9850, ctrlr: 0x558201617f80, ctrlr_name: "10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1", state: Connecting, waiters: 15 } err=OpenBdev { source: UnknownErrno }
[2024-08-08T01:16:25.586064498+00:00 ERROR io_engine::rebuild::rebuild_descriptor:rebuild_descriptor.rs:85] 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1: failed to get I/O handle: failed to open bdev
[2024-08-08T01:16:25.586085241+00:00 ERROR io_engine::rebuild::rebuild_descriptor:rebuild_descriptor.rs:85] 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1: failed to get I/O handle: failed to open bdev
[2024-08-08T01:16:25.586346683+00:00 ERROR io_engine::rebuild::rebuild_job_backend:rebuild_job_backend.rs:368] Rebuild job #705 (running) 'nvmf://10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986?uuid=16b7a8d2-e89e-4e31-9e8f-571a3726a986' -> 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e' on nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6': failed to rebuild segment id=14 block=8969216 with error: Failed to get a handle for bdev 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1
[2024-08-08T01:16:25.586468900+00:00 INFO io_engine::rebuild::rebuild_job_backend:rebuild_job_backend.rs:426] Rebuild job #705 (failed: done) 'nvmf://10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986?uuid=16b7a8d2-e89e-4e31-9e8f-571a3726a986' -> 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e' on nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6': changing state from Running to Failed; current stats: RebuildStats { blocks_total: 629135326, blocks_recovered: 8958976, blocks_transferred: 333824, blocks_remaining: 620176350, progress: 1, blocks_per_task: 128, block_size: 512, tasks_total: 16, tasks_active: 0, start_time: 2024-08-08T01:15:17.756714484Z, is_partial: false }
[2024-08-08T01:16:25.586486051+00:00 INFO io_engine::rebuild::rebuild_job_backend:rebuild_job_backend.rs:606] Rebuild job #705 (failed: done) 'nvmf://10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986?uuid=16b7a8d2-e89e-4e31-9e8f-571a3726a986' -> 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e' on nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6': backend dropped; final stats: RebuildStats { blocks_total: 629135326, blocks_recovered: 8958976, blocks_transferred: 333824, blocks_remaining: 620176350, progress: 1, blocks_per_task: 128, block_size: 512, tasks_total: 16, tasks_active: 0, start_time: 2024-08-08T01:15:17.756714484Z, is_partial: false }
[2024-08-08T01:16:25.586728047+00:00 ERROR io_engine::bdev::nexus::nexus_bdev_rebuild:nexus_bdev_rebuild.rs:472] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open out-of-sync; rebuilding]: rebuild job failed with error: Failed to get a handle for bdev 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1: failed to open bdev: UnknownErrno: Unknown errno
[2024-08-08T01:16:25.591830223+00:00 INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1049] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open out-of-sync; rebuilding]: closing child...
[2024-08-08T01:16:25.591843753+00:00 ERROR io_engine::core::descriptor:descriptor.rs:53] Bdev '3d646500-3214-4b3b-9849-139b3648e56e' is not claimed by this module 'NEXUS_CAS_MODULE'
[2024-08-08T01:16:25.591848712+00:00 INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1072] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open (destroying) out-of-sync; rebuilding]: destroying block device...
[2024-08-08T01:16:25.591874931+00:00 INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1075] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open (destroying) out-of-sync; rebuilding]: block device destroyed, waiting for removal...
[2024-08-08T01:16:25.591958581+00:00 INFO io_engine::bdev::nexus::nexus_bdev_children:nexus_bdev_children.rs:899] Unplugging nexus child device nexus_name="956e0c1b-59da-46ea-aa74-4f04bc4e75f6" child_device="3d646500-3214-4b3b-9849-139b3648e56e"
[2024-08-08T01:16:25.591966411+00:00 INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1103] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open (destroying) out-of-sync; rebuilding]: unplugging child...
[2024-08-08T01:16:25.591974582+00:00 INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:632] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels...
[2024-08-08T01:16:25.592098633+00:00 INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:655] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels completed with result: Ok
[2024-08-08T01:16:25.592145370+00:00 INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1147] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [closed (destroying) out-of-sync]: child successfully unplugged
[2024-08-08T01:16:25.592204944+00:00 INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1086] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [closed out-of-sync]: child closed successfully
[2024-08-08T01:16:25.592215300+00:00 INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:632] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: rebuild, reconfiguring I/O channels...
[2024-08-08T01:16:25.592330041+00:00 INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:655] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: rebuild, reconfiguring I/O channels completed with result: Ok
[2024-08-08T01:16:29.631726005+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:4380] [nqn.2019-05.io.openebs:a2685ffc-437b-4286-8aa7-ab6bbd64e686] Submitting Keep Alive failed
Attached are the support bundle from the two running nodes: mayastor-2024-08-08--00-35-44-UTC.tar.gz mayastor-2024-08-08--00-34-37-UTC.tar.gz
What is the value of repl in the storage class? Did the node come back online?
Seems you have hit this bug: https://github.com/openebs/mayastor/issues/1710
[2024-08-08T01:16:25.585879277+00:00 ERROR mayastor::spdk:nvme_fabric.c:596] Connect command failed, rc -125, trtype:TCP adrfam:IPv4 traddr:10.245.250.71 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986 [2024-08-08T01:16:25.585921213+00:00 ERROR mayastor::spdk:nvme_tcp.c:2017] Failed to poll NVMe-oF Fabric CONNECT command [2024-08-08T01:16:25.585926825+00:00 ERROR mayastor::spdk:nvme_tcp.c:1806] Failed to connect tqpair=0x5582015e9850
We have a fix on https://github.com/openebs/mayastor/pull/1711
hmm no logs are included in the bundle @veenadong, which plugin version did you use to create the bundle?
hmm no logs are included in the bundle @veenadong, which plugin version did you use to create the bundle?
We are running 2.5.1 and the plugin version is:
Kubectl Plugin (kubectl-mayastor) revision e8f685e3c9e1 (v2.5.0+0)
I see the fix #1711 will be merged into 2.7? What is the ETA for this fix to be available for GA?
Plan is to release in a couple of weeks along with a control-plane previously missed functionality for updating pool labels after creation.