mayastor icon indicating copy to clipboard operation
mayastor copied to clipboard

Failed to attach volume

Open hsteude opened this issue 1 year ago • 1 comments

Describe the bug

We use MayaStore in a three-node cluster running MicroK8s. After a reboot, we noticed that one of our pods, the MinIO tenant, was unable to attach a PVC. However, many other PVCs provided by Mayastor are working as expected.

In the events of the MinIO pod, we observed warnings indicating failures in volume attachment and mounting due to no online replicas being available and timeouts waiting for the condition.

│ Events:                                                                                                                                                                                                                                                                                                                                                                │
│   Type     Reason              Age                    From                     Message                                                                                                                                                                                                                                                                                 │
│   ----     ------              ----                   ----                     -------                                                                                                                                                                                                                                                                                 │
│   Warning  FailedAttachVolume  2m40s (x690 over 23h)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-aa014c14-f7b5-4b6f-8dfc-cfdc7a812661" : rpc error: code = Internal desc = Operation failed: PreconditionFailed("error in response: status code '412 Precondition Failed', content: 'RestJsonError { details: \"\", message: \"SvcError :: No │
│ OnlineReplicas: No online replicas are available for Volume 'aa014c14-f7b5-4b6f-8dfc-cfdc7a812661'\", kind: FailedPrecondition }'")                                                                                                                                                                                                                                    │
│   Warning  FailedMount         29s (x26 over 6h5m)    kubelet                  Unable to attach or mount volumes: unmounted volumes=[0], unattached volumes=[kube-api-access-7spmt 0 cfg-vol configuration]: timed out waiting for the condition

We also noted discrepancies in the kubectl-mayastor get pools output, where two out of three pools showed a status of 'Unknown' with zero capacity (we expected 'Online' and also 7TiB):

ID                     DISKS               MANAGED  NODE           STATUS   CAPACITY  ALLOCATED  AVAILABLE  COMMITTED
pool-on-spacekisste02  aio:///dev/nvme0n1  true     spacekisste02  Online   7TiB      1.5TiB     5.4TiB     623GiB
pool-on-spacekisste01  /dev/nvme0n1        true     spacekisste01  Unknown  0 B       0 B        0 B        <none>
pool-on-spacekisste03  /dev/nvme0n1        true     spacekisste03  Unknown  0 B       0 B        0 B        <none>

Further investigation into the logs of the mayastor-io-engine pods on the affected nodes revealed errors related to failed pool imports due to an illegal byte sequence and NVMe connection issues, such as invalid data parameters and failed connection queues.

io-engine [2024-06-27T15:16:13.710438575+00:00  INFO io_engine::grpc::v1::pool:pool.rs:285] ImportPoolRequest { name: "pool-on-spacekisste01", uuid: None, disks: ["/dev/nvme0n1"], pooltype: Lvs }
io-engine [2024-06-27T15:16:13.718721274+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:306] error=EILSEQ: Illegal byte sequence, failed to import pool pool-on-spacekisste01
metrics-exporter-pool   2024-06-27T15:16:16.795647Z  INFO actix_web::middleware::logger: 139.11.222.212 "GET /metrics HTTP/1.1" 200 0 "-" "Prometheus/2.47.0" 0.000118
metrics-exporter-pool     at /build/cargo-vendor-dir/actix-web-4.3.1/src/middleware/logger.rs:421

Additionally, dmesg logs on these nodes displayed NVMe connection issues such as invalid data parameters and failed connection queues:

[23161.652647] nvme nvme3: Connect Invalid Data Parameter, subsysnqn "nqn.2019-05.io.openebs:595e6507-8be0-4a43-b2ce-1e141d894700"
[23161.654479] nvme nvme3: failed to connect queue: 0 ret=386

To Reproduce

Unfortunately, we have no idea on how to reproduce this issue as it appeared spontaneously after several system reboots which had previously completed without incident.

Expected behavior

We would have expected none of the issues described above.

** OS info:**

  • node OS: Ubuntu 20.04.6 LTS
  • k8s distro: MicroK8s v1.26.15 revision 6673
  • Kernel version: 5.4.0-187-generic
  • MayaStor image: docker.io/openebs/mayastor-io-engine:v2.4.0

Additional context This is currently really blocking us, so we'd be glad for any advice or suggestions. Thanks!

hsteude avatar Jun 27 '24 15:06 hsteude

It seems like the pool is not present on the disks where we expect perhaps. I noticed you created pool with plain device uri, rather than /dev/path/by-id. Could it be possible that the order of your disks changed?

tiagolobocastro avatar Jul 03 '24 22:07 tiagolobocastro

Thanks a lot for your answer. Unfortunately, that wasn't the issue. But we'll definitely use the /dev/disk/by-id next time.

hsteude avatar Jul 05 '24 07:07 hsteude

Unfortunately, that wasn't the issue

Alright, then something else may be at play here. Would you be able to upload a support bundle? Example: kubectl mayastor -n mayastor dump system

tiagolobocastro avatar Jul 08 '24 08:07 tiagolobocastro

Hi @tiagolobocastro, sure, I have attached the bundle to this comment. Thanks for taking a look! mayastor-support-bundle.tar.gz

hsteude avatar Jul 09 '24 08:07 hsteude

@hsteude would you be able to run a container on the node to run some metadata dump utility on the disk?

tiagolobocastro avatar Jul 09 '24 17:07 tiagolobocastro

@tiagolobocastro : Yes, I can run containers on the nodes. Could you please provide the specific steps or command you would like me to use? Unfortunately, we'll need to reset the cluster and reinstalling everything tomorrow afternoon, so I would need to do this by tomorrow morning (Berlin time).

hsteude avatar Jul 09 '24 20:07 hsteude

Hey, hope it's still in time. Simply run this and capture the out, thank you

docker run --rm --name mayastor-spdk -v /dev/hugepages:/dev/hugepages --privileged -it tiagolobocastro/mayastor-spdk-blobcli:v23.05 -b nvme -D

tiagolobocastro avatar Jul 10 '24 09:07 tiagolobocastro

@tiagolobocastro Thanks, is the source of that image public?

hsteude avatar Jul 10 '24 09:07 hsteude

It is now: https://github.com/openebs/mayastor/tree/blobcli-23.05 Just builds blobcli and puts it in a container with a config file If you want to build yourself:

❯ docker load <$(nix-build -A images.mayastor-spdk)                                                                                                                                                                                      ─╯
warning: 'https://mayastor-cp.cachix.org' does not appear to be a binary cache
building '/nix/store/x2m9hr9xy0jg6qsv750vfi0vp4jd8kdw-repo_org.drv'...
patching script interpreter paths in ./git-org-name.sh
./git-org-name.sh: interpreter directive changed from "#!/usr/bin/env bash" to "/nix/store/dsd5gz46hdbdk2rfdimqddhq6m8m8fqs-bash-5.1-p16/bin/bash"
building '/nix/store/wdsqyjws8vmgd1m029bkzkl8ha2vzhmr-io-engine-version.drv'...
this derivation will be built:
  /nix/store/0v5ysdn7qd65q8v4yaawik7ldc2rdj7c-docker-image-mayastor-spdk.tar.gz.drv
building '/nix/store/0v5ysdn7qd65q8v4yaawik7ldc2rdj7c-docker-image-mayastor-spdk.tar.gz.drv'...
Adding layer...
tar: Removing leading `/' from member names
Adding meta...
Cooking the image...
Finished.
Loaded image: openebs/mayastor-spdk:703fff02c8ff

Then rename/tag as necessary.

tiagolobocastro avatar Jul 10 '24 11:07 tiagolobocastro

Thanks a lot @tiagolobocastro! I tried it but ran into this:

~$ docker run --rm --name mayastor-spdk -v /dev/hugepages:/dev/hugepages --privileged -it tiagolobocastro/mayastor-spdk-blobcli:v23.05 -b nvme0n1 -D
[2024-07-10 12:21:36.036047] Starting SPDK v23.05 / DPDK 22.11.1 initialization...
[2024-07-10 12:21:36.036115] [ DPDK EAL parameters: blobcli --no-shconf -c 0x1 --huge-unlink --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --iova-mode=pa --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid1 ]
EAL: Error creating '/var/run/dpdk': No such file or directory
EAL: No DPDK runtime directory created
TELEMETRY: No legacy callbacks, legacy socket not created
[2024-07-10 12:21:36.157624] app.c: 767:spdk_app_start: *NOTICE*: Total cores available: 1
[2024-07-10 12:21:36.210992] reactor.c: 937:reactor_run: *NOTICE*: Reactor started on core 0
[2024-07-10 12:21:36.254003] accel_sw.c: 603:sw_accel_module_init: *NOTICE*: Accel framework software module initialized.
[2024-07-10 12:21:36.261860] bdev_aio.c: 110:bdev_aio_open: *ERROR*: open() failed (file:/dev/sda), errno 2: No such file or directory
[2024-07-10 12:21:36.261893] bdev_aio.c: 725:create_aio_bdev: *ERROR*: Unable to open file /dev/sda. fd: -1 errno: 2
[2024-07-10 12:21:36.265842] json_config.c: 194:rpc_client_poller: *ERROR*: error response:
{
  "code": -2,
  "message": "No such file or directory"
}
[2024-07-10 12:21:36.265974] app.c: 873:spdk_app_stop: *WARNING*: spdk_app_stop'd on non-zero
ERROR!

I don't know why it's looking for /dev/sda.

hsteude avatar Jul 10 '24 12:07 hsteude

Sorry I had left an entry for sda for my testing, I've pushed both the branch and the image. Could you please try again?

tiagolobocastro avatar Jul 10 '24 12:07 tiagolobocastro

That worked, this is the output on one of the broken nodes: mayastor-spdk-output-spaceksisste01.txt.zip

hsteude avatar Jul 10 '24 13:07 hsteude

Great thanks, could you also run:

docker run --rm --name mayastor-spdk -v /dev/hugepages:/dev/hugepages --privileged -it tiagolobocastro/mayastor-spdk-blobcli:v23.05 -b nvme0n1 -s bs

tiagolobocastro avatar Jul 10 '24 13:07 tiagolobocastro

@tiagolobocastro Thanks a lot for your help. I'm getting an error running that:

~$ docker run --rm --name mayastor-spdk -v /dev/hugepages:/dev/hugepages --privileged -it tiagolobocastro/mayastor-spdk-blobcli:v23.05 -b nvme0n1 -s bs
[2024-07-10 14:13:46.235060] Starting SPDK v23.05 / DPDK 22.11.1 initialization...
[2024-07-10 14:13:46.235137] [ DPDK EAL parameters: blobcli --no-shconf -c 0x1 --huge-unlink --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --iova-mode=pa --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid1 ]
EAL: Error creating '/var/run/dpdk': No such file or directory
EAL: No DPDK runtime directory created
TELEMETRY: No legacy callbacks, legacy socket not created
[2024-07-10 14:13:46.355574] app.c: 767:spdk_app_start: *NOTICE*: Total cores available: 1
[2024-07-10 14:13:46.399475] reactor.c: 937:reactor_run: *NOTICE*: Reactor started on core 0
[2024-07-10 14:13:46.446966] accel_sw.c: 603:sw_accel_module_init: *NOTICE*: Accel framework software module initialized.
[2024-07-10 14:13:46.575475] blobstore.c:4696:bs_recover: *NOTICE*: Performing recovery on blobstore
[2024-07-10 14:13:46.575630] blobstore.c:4643:bs_load_replay_md_cpl: *NOTICE*: Recover: blob 0x0
[2024-07-10 14:13:46.575720] blobstore.c:4643:bs_load_replay_md_cpl: *NOTICE*: Recover: blob 0x1
Error in load callback (err -84)
[2024-07-10 14:13:46.584808] app.c: 873:spdk_app_stop: *WARNING*: spdk_app_stop'd on non-zero
ERROR!

hsteude avatar Jul 10 '24 14:07 hsteude

On the node that works, the output looks slightly different:

~$ docker run --rm --name mayastor-spdk -v /dev/hugepages:/dev/hugepages --privileged -it tiagolobocastro/mayastor-spdk-blobcli:v23.05 -b nvme0n1 -s bs
Unable to find image 'tiagolobocastro/mayastor-spdk-blobcli:v23.05' locally
v23.05: Pulling from tiagolobocastro/mayastor-spdk-blobcli
1b4133f42d6a: Pull complete
Digest: sha256:b9db3205d095393d8b2b597cd5db9b03e5132e6e5bdfb0e0e97906a8f79d0897
Status: Downloaded newer image for tiagolobocastro/mayastor-spdk-blobcli:v23.05
[2024-07-10 14:45:27.875845] Starting SPDK v23.05 / DPDK 22.11.1 initialization...
[2024-07-10 14:45:27.875917] [ DPDK EAL parameters: blobcli --no-shconf -c 0x1 --huge-unlink --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --iova-mode=pa --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid1 ]
EAL: Error creating '/var/run/dpdk': No such file or directory
EAL: No DPDK runtime directory created
TELEMETRY: No legacy callbacks, legacy socket not created
[2024-07-10 14:45:27.998015] app.c: 767:spdk_app_start: *NOTICE*: Total cores available: 1
[2024-07-10 14:45:28.055415] reactor.c: 937:reactor_run: *NOTICE*: Reactor started on core 0
[2024-07-10 14:45:28.098014] accel_sw.c: 603:sw_accel_module_init: *NOTICE*: Accel framework software module initialized.
[2024-07-10 14:45:28.271874] blobstore.c:1659:blob_load_cpl: *ERROR*: Metadata page 2227 crc mismatch for blobid 0x100000001
Blobstore Public Info:
	Using bdev Product Name: AIO disk
	API Version: 3
	super blob ID: 0x100000000
	page size: 4096
	io unit size: 512
	cluster size: 4194304
	# free clusters: 1426423
	blobstore type:
00000000  4c 56 4f 4c 53 54 4f 52  45 00 00 00 00 00 00 00   LVOLSTORE.......

Blobstore Private Info:
	Metadata start (pages): 169
	Metadata length (pages): 183141

hsteude avatar Jul 10 '24 14:07 hsteude

hmm this might require a lot more debugging of the data structures to find the issue.

If you could send us a dump of the start of the disk so we can better diagnose, maybe say 200MiB that would also help. This is only metadata - by default you get 4MiB for the super and then 1 page (4kiB) for every cluster. On your 7T disk you can see you have 183141 pages, which means metadata lives up to 754MiB, before your data starts.

tiagolobocastro avatar Jul 10 '24 15:07 tiagolobocastro

Alright, thanks a lot for your investigation. I'll create the dump and then proceed to nuke the drives, as we really need to get the deployment up and running again. Do you have any idea how I can easily verify that no sensitive data is included in the dump? BTW: we're really looking forward to seeing this being released: https://github.com/openebs/mayastor/issues/1064

hsteude avatar Jul 10 '24 16:07 hsteude

Do you have any idea how I can easily verify that no sensitive data is included in the dump?

Create the dump like so:

> sudo dd if=/dev/nvme0n1 of=dump.io bs=1M count=200

You can run strings against the dump, example:

> strings dump.io
SPDKBLOB
LVOLSTORE
TUUP
UQ@UUAU
...................
@@Q@A
uuid54c14dab-8461-49a3-ac7d-dbf1287d47c2
namepool-nvme0n1
name08b67cce-b785-4c3c-921e-39a660167d39
uuid08b67cce-b785-4c3c-921e-39a660167d39
sharedtrue
.......................

BTW: we're really looking forward to seeing this being released: #1064

It's released :)

tiagolobocastro avatar Jul 10 '24 16:07 tiagolobocastro

Hey @hsteude did you ever collect the metadata dump?

tiagolobocastro avatar Jul 23 '24 11:07 tiagolobocastro

Please reopen if required, thank you.

tiagolobocastro avatar Aug 06 '24 16:08 tiagolobocastro

Hi @tiagolobocastro ,

We're facing the same problem. 2 out of 3 nodes can't connect back the disks into the pool.

Our disks were referenced by-id and there's no changes in their structure, they simply fail after a major network outage in the provider + reboot.

We've tried to build your mayastor-spdk image but it fails due to some dependency missing:

https://github.com/openebs/mayastor/tree/blobcli-23.05

image

We've tried modifying nixos/nix not to use latest, and tried with versions 2.24.11 and 2.24.4 with exactly the same error.

Any help would be appreciated.

Best regards,

kpeiruza avatar Jan 13 '25 08:01 kpeiruza

@tiagolobocastro Do we need to reopen this?

Abhinandan-Purkait avatar Jan 19 '25 15:01 Abhinandan-Purkait

Hey there, we started facing the same issue a couple of days ago when our kubernetes hosts restarted. On the surface level, for us, it seems like the hosts restarted because OpenEBS caused the kernel to crash. Since then it is unable to create diskpools.

The openebs-io-engine pods throw this:

  2025-04-15T15:32:34.158261Z ERROR metrics_exporter_io_engine::cache::pool: Error getting pools data, invalidating pools cache, error: GrpcResponseError("status: Unavailable, message: \"error trying to connect: tcp connect error: Connection refused (os error 111)\", details: [], metadata: MetadataMap { headers: {} }")
    at metrics-exporter/src/bin/io_engine/cache/pool.rs:38

  2025-04-15T15:32:34.159194Z ERROR metrics_exporter_io_engine::cache::pool_stat: Error getting pools data, invalidating pools cache, error: GrpcResponseError("status: Unavailable, message: \"error trying to connect: tcp connect error: Connection refused (os error 111)\", details: [], metadata: MetadataMap { headers: {} }")
    at metrics-exporter/src/bin/io_engine/cache/pool_stat.rs:38

  2025-04-15T15:32:34.160005Z ERROR metrics_exporter_io_engine::cache::nexus_stat: Error getting nexus stats data, invalidating nexus stats cache, error: GrpcResponseError("status: Unavailable, message: \"error trying to connect: tcp connect error: Connection refused (os error 111)\", details: [], metadata: MetadataMap { headers: {} }")
    at metrics-exporter/src/bin/io_engine/cache/nexus_stat.rs:37

  2025-04-15T15:32:34.160671Z ERROR metrics_exporter_io_engine::cache::replica_stat: Error getting replica stats data, invalidating replica cache, error: GrpcResponseError("status: Unavailable, message: \"error trying to connect: tcp connect error: Connection refused (os error 111)\", details: [], metadata: MetadataMap { headers: {} }")
    at metrics-exporter/src/bin/io_engine/cache/replica_stat.rs:40

  2025-04-15T15:32:34.161062Z  INFO actix_web::middleware::logger: 10.200.0.18 "GET /metrics HTTP/1.1" 200 0 "-" "Prometheus/2.51.2" 0.003858
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

Current versions Os-Release: Ubuntu 22.04.5 LTS Kernel version: 5.15.0-136-generic OpenEBS version: 4.1.3

Our plan is to upgrade to the latest Ubuntu version and also update the kernel version too

Planned versions Os-Release: Ubuntu 24.04.2 LTS Kernel version: >=5.19 OpenEBS version: 4.1.3

If you have any idea on how this issue happens or what the solution could be to avoid this in the future, let me know.

smorci-alchemist avatar Apr 15 '25 15:04 smorci-alchemist

Could you please share logs of the io_engine container within the io_engine pod

tiagolobocastro avatar Apr 15 '25 15:04 tiagolobocastro

Here are the logs. Sorry for the color codes.

smorci-alchemist avatar Apr 16 '25 08:04 smorci-alchemist

Those are the logs of the exporter container. In the io-engine pod there are 2 containers, io-engine and the metrics-exporter-io-engine. Please share the logs of the io-engine pod, ex: logs io-engine-xxxxxx -c io-engine

tiagolobocastro avatar Apr 16 '25 08:04 tiagolobocastro

Right, sorry about that. Here you go:

[2025-04-16T08:21:20.327356753+00:00  INFO io_engine:io-engine.rs:253] Engine responsible for managing I/Os version 1.0.0, revision 02e79b6205f4 (v2.7.3+0)
[2025-04-16T08:21:20.327637869+00:00  INFO io_engine:io-engine.rs:232] free_pages 2MB: 1024 nr_pages 2MB: 1024
[2025-04-16T08:21:20.327652901+00:00  INFO io_engine:io-engine.rs:233] free_pages 1GB: 0 nr_pages 1GB: 0
[2025-04-16T08:21:20.327932358+00:00  INFO io_engine:io-engine.rs:285] kernel io_uring support: yes
[2025-04-16T08:21:20.327941818+00:00  INFO io_engine:io-engine.rs:289] kernel nvme initiator multipath support: yes
[2025-04-16T08:21:20.327998278+00:00  INFO io_engine::core::env:env.rs:945] loading mayastor config YAML file /var/local/openebs/io-engine/config.yaml
[2025-04-16T08:21:20.328036793+00:00  INFO io_engine::subsys::config:mod.rs:189] Config file /var/local/openebs/io-engine/config.yaml is empty, reverting to default config
[2025-04-16T08:21:20.328059826+00:00  INFO io_engine::subsys::config::opts:opts.rs:169] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2025-04-16T08:21:20.328069884+00:00  INFO io_engine::subsys::config::opts:opts.rs:169] Overriding NVMF_TCP_MAX_QPAIRS_PER_CTRL value to '32'
[2025-04-16T08:21:20.328087066+00:00  INFO io_engine::subsys::config::opts:opts.rs:236] Overriding NVME_TIMEOUT value to '300s'
[2025-04-16T08:21:20.328096352+00:00  INFO io_engine::subsys::config::opts:opts.rs:236] Overriding NVME_TIMEOUT_ADMIN value to '30s'
[2025-04-16T08:21:20.328103404+00:00  INFO io_engine::subsys::config::opts:opts.rs:236] Overriding NVME_KATO value to '10s'
[2025-04-16T08:21:20.328133968+00:00  INFO io_engine::subsys::config:mod.rs:240] Applying Mayastor configuration settings
[2025-04-16T08:21:20.328144734+00:00  INFO io_engine::subsys::config::opts:opts.rs:363] NVMe Bdev options successfully applied
[2025-04-16T08:21:20.328152226+00:00  INFO io_engine::subsys::config::opts:opts.rs:500] Bdev options successfully applied
[2025-04-16T08:21:20.328161201+00:00  INFO io_engine::subsys::config::opts:opts.rs:637] Socket options successfully applied
[2025-04-16T08:21:20.328167961+00:00  INFO io_engine::subsys::config::opts:opts.rs:668] I/O buffer options successfully applied
[2025-04-16T08:21:20.328174644+00:00  INFO io_engine::subsys::config:mod.rs:246] Config {
    source: Some(
        "/var/local/openebs/io-engine/config.yaml",
    ),
    nvmf_tcp_tgt_conf: NvmfTgtConfig {
        name: "mayastor_target",
        max_namespaces: 2048,
        crdt: [
            30,
            0,
            0,
        ],
        opts: NvmfTcpTransportOpts {
            max_queue_depth: 32,
            max_qpairs_per_ctrl: 32,
            in_capsule_data_size: 4096,
            max_io_size: 131072,
            io_unit_size: 131072,
            max_aq_depth: 32,
            num_shared_buf: 2047,
            buf_cache_size: 64,
            dif_insert_or_strip: false,
            abort_timeout_sec: 1,
            acceptor_poll_rate: 10000,
            zcopy: true,
        },
        interface: None,
    },
    nvme_bdev_opts: NvmeBdevOpts {
        action_on_timeout: 4,
        timeout_us: 300000000,
        timeout_admin_us: 30000000,
        keep_alive_timeout_ms: 10000,
        transport_retry_count: 0,
        arbitration_burst: 0,
        low_priority_weight: 0,
        medium_priority_weight: 0,
        high_priority_weight: 0,
        nvme_adminq_poll_period_us: 1000,
        nvme_ioq_poll_period_us: 0,
        io_queue_requests: 0,
        delay_cmd_submit: true,
        bdev_retry_count: 0,
        transport_ack_timeout: 0,
        ctrlr_loss_timeout_sec: 0,
        reconnect_delay_sec: 0,
        fast_io_fail_timeout_sec: 0,
        disable_auto_failback: false,
        generate_uuids: true,
    },
    bdev_opts: BdevOpts {
        bdev_io_pool_size: 65535,
        bdev_io_cache_size: 512,
    },
    nexus_opts: NexusOpts {
        nvmf_enable: true,
        nvmf_discovery_enable: true,
        nvmf_nexus_port: 4421,
        nvmf_replica_port: 8420,
    },
    socket_opts: PosixSocketOpts {
        recv_buf_size: 2097152,
        send_buf_size: 2097152,
        enable_recv_pipe: true,
        enable_zero_copy_send: true,
        enable_quickack: true,
        enable_placement_id: 0,
        enable_zerocopy_send_server: true,
        enable_zerocopy_send_client: false,
        zerocopy_threshold: 0,
    },
    iobuf_opts: IoBufOpts {
        small_pool_count: 8192,
        large_pool_count: 2048,
        small_bufsize: 8192,
        large_bufsize: 135168,
    },
    eal_opts: EalOpts {
        reactor_mask: None,
        core_list: None,
        developer_delay: None,
    },
}
TELEMETRY: No legacy callbacks, legacy socket not created
[2025-04-16T08:21:20.447643985+00:00  INFO io_engine::core::mempool:mempool.rs:53] Memory pool 'bdev_io_ctx' with 65535 elements (24 bytes size) successfully created
[2025-04-16T08:21:20.454508780+00:00  INFO io_engine::core::mempool:mempool.rs:53] Memory pool 'nvme_ctrl_io_ctx' with 65535 elements (72 bytes size) successfully created
[2025-04-16T08:21:20.454531207+00:00  INFO io_engine::core::env:env.rs:1017] Total number of cores available: 2
[2025-04-16T08:21:20.513842111+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'init_thread' (0x56042349da10) on core #4
[2025-04-16T08:21:20.513914625+00:00  INFO io_engine::core::reactor:reactor.rs:165] Init thread ID 1
[2025-04-16T08:21:20.514132917+00:00  INFO io_engine::core::reactor:reactor.rs:309] Starting reactor polling loop [3mcore[0m[2m=[0m5 [3mtid[0m[2m=[0m12
[2025-04-16T08:21:20.515149207+00:00  INFO io_engine::core::env:env.rs:1052] All cores locked and loaded!
[2025-04-16T08:21:20.650573846+00:00  INFO io_engine::bdev::nexus::nexus_module:nexus_module.rs:41] Initializing Nexus CAS Module
[2025-04-16T08:21:20.650760105+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_4' (0x56042349f580) on core #4
[2025-04-16T08:21:20.650844191+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_5' (0x56042349fbd0) on core #5
[2025-04-16T08:21:20.653499550+00:00  INFO io_engine::core::env:env.rs:904] Using 'MY_POD_IP' environment variable for IP address for NVMF target network interface
[2025-04-16T08:21:20.653613467+00:00  INFO io_engine::subsys::nvmf::target:target.rs:263] nvmf target listening on 10.200.0.17:(4421,8420)
[2025-04-16T08:21:20.653657295+00:00  INFO io_engine::subsys::nvmf::subsystem:subsystem.rs:842] Subsystem start in progress... [3mself[0m[2m=[0mNvmfSubsystem { id: 0, subtype: "Discovery", subnqn: "nqn.2014-08.org.nvmexpress.discovery", sn: "00000000000000000000", mn: "Mayastor NVMe controller", allow_any_host: 1, ana_reporting: 0, listeners: Some([Transport ID { trtype: 3, trstring: "TCP", traddr: "10.200.0.17", trsvcid: "8420" }]) }
[2025-04-16T08:21:20.653721950+00:00  INFO io_engine::subsys::nvmf::subsystem:subsystem.rs:893] Subsystem start completed: Ok [3mself[0m[2m=[0mNvmfSubsystem { id: 0, subtype: "Discovery", subnqn: "nqn.2014-08.org.nvmexpress.discovery", sn: "00000000000000000000", mn: "Mayastor NVMe controller", allow_any_host: 1, ana_reporting: 0, listeners: Some([Transport ID { trtype: 3, trstring: "TCP", traddr: "10.200.0.17", trsvcid: "8420" }]) }
[2025-04-16T08:21:20.653745732+00:00  INFO io_engine::subsys::nvmf::target:target.rs:361] nvmf target accepting new connections and is ready to roll..💃
[2025-04-16T08:21:20.751289020+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'iscsi_poll_group_4' (0x56042364b6b0) on core #4
[2025-04-16T08:21:20.751545991+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'iscsi_poll_group_5' (0x56042364ba00) on core #5
[2025-04-16T08:21:20.751603929+00:00  INFO io_engine::core::env:env.rs:929] RPC server listening at: /var/tmp/mayastor.sock
[2025-04-16T08:21:20.751989428+00:00  WARN io_engine:io-engine.rs:90] Nexus reset is disabled
[2025-04-16T08:21:20.751997806+00:00  INFO io_engine:io-engine.rs:109] Async QPair connection feature ('spdk-async-qpair-connect') is enabled
[2025-04-16T08:21:20.752001792+00:00  INFO io_engine:io-engine.rs:110] Fault injection feature ('fault-injection') is disabled
[2025-04-16T08:21:20.773455708+00:00  INFO io_engine::persistent_store:persistent_store.rs:149] Connected to etcd on endpoint openebs-etcd:2379
[2025-04-16T08:21:20.773541862+00:00  INFO io_engine::persistent_store:persistent_store.rs:125] Persistent store operation timeout: 10s, number of retries: 300
[2025-04-16T08:21:20.773624335+00:00  INFO io_engine::grpc::server:server.rs:79] [V1] gRPC server configured at address 10.200.0.17:10124
[2025-04-16T08:21:20.773927548+00:00  INFO io_engine::subsys::registration::registration_grpc:registration_grpc.rs:211] Registering '"rpc-platform-prod-use1-1"' with grpc server 10.200.0.17:10124 ...
[2025-04-16T08:21:20.788066827+00:00  INFO io_engine::grpc::v1::pool:pool.rs:571] ImportPoolRequest { name: "diskpool-rpc-platform-prod-use1-1", uuid: None, disks: ["/dev/mayastor_vg/mayastor_lv"], pooltype: Lvs }

smorci-alchemist avatar Apr 16 '25 08:04 smorci-alchemist

hmm pool import seems stuck. Perchance the your logical volume under mayastor is not suspended is it? sudo -E dmsetup info -C -o info_all /dev/mayastor_vg/mayastor_lv

tiagolobocastro avatar Apr 16 '25 09:04 tiagolobocastro

@tiagolobocastro I ran the command on our kubernetes hosts and the logical volumes all seem to be active. The outputs are almost identical

BlkDevName       Stat Tables Suspended Read-only DevNo Maj Min Open Targ Event
dm-0             L--w Live   Active    Writeable 253:0 253   0    1    1      0

smorci-alchemist avatar Apr 16 '25 14:04 smorci-alchemist

How odd, are the logs still the same or any further messages?

tiagolobocastro avatar Apr 16 '25 14:04 tiagolobocastro