node-disk-manager NDM looping constantly causing high cpu usage with `Error: unreachable state`

What steps did you take and what happened: I've just installed openebs as part of k0s on an aws ec2 instance with 2 disks, the host disk and a separate ebs data partition. Everything seems to be working fine but one of the ndm pods is at a constant 20% cpu usage. looking at the logs it seems to be in some loop querying the host/node disks

Looking at another server with the same ndm version but a simpler, single-disk setup, the exact same thing is happening.

What did you expect to happen: I expected the ndm process to not be constantly using cpu in a constant loop.

The output of the following commands will help us better understand what's going on: [Pasting long output into a GitHub gist or other pastebin is fine.]

kubectl get pods -n openebs

NAME                                           READY   STATUS    RESTARTS      AGE
openebs-localpv-provisioner-6ccc9d6fc9-kcnhs   1/1     Running   9 (19h ago)   20h
openebs-ndm-jpvpw                              1/1     Running   0             26m
openebs-ndm-operator-7bd6898d96-vz54r          1/1     Running   9 (19h ago)   20h

kubectl get blockdevices -n openebs -o yaml

apiVersion: v1
items:
- apiVersion: openebs.io/v1alpha1
  kind: BlockDevice
  metadata:
    annotations:
      internal.openebs.io/uuid-scheme: gpt
    creationTimestamp: "2022-07-05T13:22:38Z"
    generation: 20
    labels:
      kubernetes.io/hostname: ip-172-31-18-163.eu-west-1.compute.internal
      ndm.io/blockdevice-type: blockdevice
      ndm.io/managed: "true"
    name: blockdevice-01fd0d0d966998648102985c5f12e22a
    namespace: openebs
    resourceVersion: "64236"
    uid: 9d3e2ec3-57b5-4303-829c-e0cfa51f2f07
  spec:
    capacity:
      logicalSectorSize: 512
      physicalSectorSize: 512
      storage: 137437888000
    details:
      compliance: ""
      deviceType: partition
      driveType: SSD
      firmwareRevision: ""
      hardwareSectorSize: 512
      logicalBlockSize: 512
      model: Amazon Elastic Block Store
      physicalBlockSize: 512
      serial: vol033aa51d4508ed1b0
      vendor: ""
    devlinks:
    - kind: by-id
      links:
      - /dev/disk/by-id/nvme-nvme.1d0f-766f6c3033336161353164343530386564316230-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001-part1
      - /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol033aa51d4508ed1b0-part1
      - /dev/disk/by-id/wwn-nvme.1d0f-766f6c3033336161353164343530386564316230-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001-part1
    - kind: by-path
      links:
      - /dev/disk/by-path/pci-0000:00:1f.0-nvme-1-part1
    filesystem:
      fsType: xfs
      mountPoint: /var/openebs
    nodeAttributes:
      nodeName: ip-172-31-18-163.eu-west-1.compute.internal
    partitioned: "No"
    path: /dev/nvme1n1p1
  status:
    claimState: Unclaimed
    state: Inactive
kind: List
metadata:
  resourceVersion: ""

kubectl get blockdeviceclaims -n openebs -o yaml

apiVersion: v1
items: []
kind: List
metadata:
  resourceVersion: ""

kubectl logs <ndm daemon pod name> -n openebs

just including two loops, it goes on like this permanently.

https://gist.github.com/magnetised/c1f2bef4242b663721d87898f8416d65

lsblk from nodes where ndm daemonset is running

NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme1n1     259:0    0  128G  0 disk
└─nvme1n1p1 259:4    0  128G  0 part /var/openebs
nvme0n1     259:1    0  128G  0 disk
├─nvme0n1p1 259:2    0    1M  0 part
└─nvme0n1p2 259:3    0  128G  0 part /

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

OpenEBS version

openebs.io/version=3.0.0 node-disk-manager:1.7.0

Kubernetes version (use kubectl version):

Client Version: v1.24.2
Kustomize Version: v4.5.4
Server Version: v1.23.6+k0s

Kubernetes installer & version:

K0s version v1.23.6+k0s.0

Cloud provider or hardware configuration:

AWS EC2 instance

Type of disks connected to the nodes (eg: Virtual Disks, GCE/EBS Volumes, Physical drives etc)

host root partition nvme0n1 open ebs volume nvme1n1 with a single partition nvme1n1p1 mounted at /var/openebs

OS (e.g. from /etc/os-release):

NAME="Red Hat Enterprise Linux"
VERSION="8.6 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.6"

Jul 06 '22 11:07 magnetised

Exact same issue with vanilla k0s v1.27.2+k0s.0 installation with open ebs extension enabled (openebs/node-disk-manager:1.9.0), consumes over 60% of CPU on IDLE with no PVs no nothing, this is really bad.

Jun 25 '23 08:06 artem-zinnatullin

Hi, we had the same issue on-premise and it was caused by the presence of "/dev/sr1" on the vm, so I think you should update the filter to remove unusable devices.

Nov 21 '23 14:11 gervaso

node-disk-manager node-disk-manager copied to clipboard

NDM looping constantly causing high cpu usage with `Error: unreachable state`

node-disk-manager
node-disk-manager copied to clipboard