piraeus-operator
piraeus-operator copied to clipboard
Error “can't read superblock on /dev/drbd1001.” when trying to start a pod
I see errors in the pod events:
$ kubectl describe pod -n monitoring kube-prometheus-stack-grafana-5fd67c647-7m5hs
Name: kube-prometheus-stack-grafana-5fd67c647-7m5hs
Namespace: monitoring
Priority: 0
Service Account: kube-prometheus-stack-grafana
Node: k8s-m0/10.1.27.210
Start Time: Mon, 03 Jun 2024 08:55:48 +0200
Labels: app.kubernetes.io/instance=kube-prometheus-stack
app.kubernetes.io/name=grafana
pod-template-hash=5fd67c647
Annotations: checksum/config: 4eb333c36d59bdac7f568835d4c71dc0e5b1ca4f0bbc06eb5379f9ca352b8914
checksum/sc-dashboard-provider-config: 593c0a8778b83f11fe80ccb21dfb20bc46705e2be3178df1dc4c89d164c8cd9c
checksum/secret: 032056e9c62bbe9d1daa41ee49cd3d9524c076f51ca4c65adadf4ef08ef28712
kubectl.kubernetes.io/default-container: grafana
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/kube-prometheus-stack-grafana-5fd67c647
Init Containers:
init-chown-data:
Container ID:
Image: docker.io/library/busybox:1.31.1
Image ID:
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
Command:
chown
-R
472:472
/var/lib/grafana
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/lib/grafana from storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2xnp (ro)
Containers:
grafana-sc-dashboard:
Container ID:
Image: quay.io/kiwigrid/k8s-sidecar:1.26.1
Image ID:
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
METHOD: WATCH
LABEL: grafana_dashboard
LABEL_VALUE: 1
FOLDER: /tmp/dashboards
RESOURCE: both
NAMESPACE: ALL
REQ_USERNAME: <set to the key 'admin-user' in secret 'kube-prometheus-stack-grafana'> Optional: false
REQ_PASSWORD: <set to the key 'admin-password' in secret 'kube-prometheus-stack-grafana'> Optional: false
REQ_URL: http://localhost:3000/api/admin/provisioning/dashboards/reload
REQ_METHOD: POST
Mounts:
/tmp/dashboards from sc-dashboard-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2xnp (ro)
grafana-sc-datasources:
Container ID:
Image: quay.io/kiwigrid/k8s-sidecar:1.26.1
Image ID:
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
METHOD: WATCH
LABEL: grafana_datasource
LABEL_VALUE: 1
FOLDER: /etc/grafana/provisioning/datasources
RESOURCE: both
REQ_USERNAME: <set to the key 'admin-user' in secret 'kube-prometheus-stack-grafana'> Optional: false
REQ_PASSWORD: <set to the key 'admin-password' in secret 'kube-prometheus-stack-grafana'> Optional: false
REQ_URL: http://localhost:3000/api/admin/provisioning/datasources/reload
REQ_METHOD: POST
Mounts:
/etc/grafana/provisioning/datasources from sc-datasources-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2xnp (ro)
grafana:
Container ID:
Image: docker.io/grafana/grafana:10.4.1
Image ID:
Ports: 3000/TCP, 9094/TCP, 9094/UDP
Host Ports: 0/TCP, 0/TCP, 0/UDP
SeccompProfile: RuntimeDefault
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Liveness: http-get http://:3000/api/health delay=60s timeout=30s period=10s #success=1 #failure=10
Readiness: http-get http://:3000/api/health delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_IP: (v1:status.podIP)
GF_SECURITY_ADMIN_USER: <set to the key 'admin-user' in secret 'kube-prometheus-stack-grafana'> Optional: false
GF_SECURITY_ADMIN_PASSWORD: <set to the key 'admin-password' in secret 'kube-prometheus-stack-grafana'> Optional: false
GF_PATHS_DATA: /var/lib/grafana/
GF_PATHS_LOGS: /var/log/grafana
GF_PATHS_PLUGINS: /var/lib/grafana/plugins
GF_PATHS_PROVISIONING: /etc/grafana/provisioning
Mounts:
/etc/grafana/grafana.ini from config (rw,path="grafana.ini")
/etc/grafana/provisioning/dashboards/sc-dashboardproviders.yaml from sc-dashboard-provider (rw,path="provider.yaml")
/etc/grafana/provisioning/datasources from sc-datasources-volume (rw)
/tmp/dashboards from sc-dashboard-volume (rw)
/var/lib/grafana from storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2xnp (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kube-prometheus-stack-grafana
Optional: false
storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: kube-prometheus-stack-grafana
ReadOnly: false
sc-dashboard-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
sc-dashboard-provider:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kube-prometheus-stack-grafana-config-dashboards
Optional: false
sc-datasources-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-p2xnp:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 49m default-scheduler Successfully assigned monitoring/kube-prometheus-stack-grafana-5fd67c647-7m5hs to k8s-m0
Warning FailedAttachVolume 49m attachdetach-controller Multi-Attach error for volume "pvc-40a7bc3f-d655-4606-a671-863913f657c0" Volume is already used by pod(s) kube-prometheus-stack-grafana-5fd67c647-dcqrx
Warning FailedMount 49m kubelet MountVolume.SetUp failed for volume "pvc-40a7bc3f-d655-4606-a671-863913f657c0" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-40a7bc3f-d655-4606-a671-863913f657c0: mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o _netdev /dev/drbd1001 /var/lib/kubelet/pods/1fc6504d-b1e0-43d1-8012-32a69210124a/volumes/kubernetes.io~csi/pvc-40a7bc3f-d655-4606-a671-863913f657c0/mount
Output: mount: /var/lib/kubelet/pods/1fc6504d-b1e0-43d1-8012-32a69210124a/volumes/kubernetes.io~csi/pvc-40a7bc3f-d655-4606-a671-863913f657c0/mount: can't read superblock on /dev/drbd1001.
dmesg(1) may have more information after failed mount system call.
Normal SuccessfulAttachVolume 4m5s (x46 over 49m) attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-40a7bc3f-d655-4606-a671-863913f657c0"
Warning FailedMount 6s (x31 over 49m) kubelet MountVolume.WaitForAttach failed for volume "pvc-40a7bc3f-d655-4606-a671-863913f657c0" : volume pvc-40a7bc3f-d655-4606-a671-863913f657c0 has GET error for volume attachment csi-1e1f5fcc627e2e7c3ca2595b3593e4d17c9188e63cec5e7d8f1cbe950418e564: volumeattachments.storage.k8s.io "csi-1e1f5fcc627e2e7c3ca2595b3593e4d17c9188e63cec5e7d8f1cbe950418e564" is forbidden: User "system:node:k8s-m0" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'k8s-m0' and this object
The cluster consists of three master/worker nodes k8s-m0, k8s-m1 and k8s-m2
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-m0 Ready control-plane 28d v1.28.9 10.1.27.210 <none> Ubuntu 22.04.4 LTS 5.15.0-107-generic containerd://1.7.3
k8s-m1 Ready control-plane 28d v1.28.9 10.1.27.211 <none> Ubuntu 22.04.4 LTS 5.15.0-107-generic containerd://1.7.3
k8s-m2 Ready control-plane 28d v1.28.9 10.1.27.212 <none> Ubuntu 22.04.4 LTS 5.15.0-107-generic containerd://1.7.3
I see only one linstor error
$ kubectl exec -ti -n piraeus-datastore deployments/linstor-controller -- /bin/bash
root@linstor-controller-797bc7456f-8mgws:/# linstor error-reports list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Id ┊ Datetime ┊ Node ┊ Exception ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ 664F7013-00000-000000 ┊ 2024-06-02 22:00:12 ┊ C|linstor-controller-797bc7456f-8mgws ┊ SocketException: Network is unreachable ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
root@linstor-controller-797bc7456f-8mgws:/# linstor error-reports show 664F7013-00000-000000
ERROR REPORT 664F7013-00000-000000
============================================================
Application: LINBIT? LINSTOR
Module: Controller
Version: 1.27.0
Build ID: 8250eddde5f533facba39b4d1f77f1ef85f8521d
Build time: 2024-04-02T07:12:21+00:00
Error time: 2024-06-02 22:00:12
Node: linstor-controller-797bc7456f-8mgws
Thread: SslConnector
Access context information
Identity: PUBLIC
Role: PUBLIC
Domain: PUBLIC
Peer: Node: 'k8s-m2'
============================================================
Reported error:
===============
Category: Exception
Class name: SocketException
Class canonical name: java.net.SocketException
Generated at: Method 'pollConnect', Source file 'Net.java, Unknown line number
Error message: Network is unreachable
Error context:
I/O exception while attempting to connect to the peer
Call backtrace:
Method Native Class:Line number
pollConnect Y sun.nio.ch.Net:unknown
pollConnectNow N sun.nio.ch.Net:672
finishConnect N sun.nio.ch.SocketChannelImpl:946
establishConnection N com.linbit.linstor.netcom.TcpConnectorService:993
run N com.linbit.linstor.netcom.TcpConnectorService:728
run N java.lang.Thread:840
END OF ERROR REPORT.
There is an issue with PVC used by the above pod when the pod is running on node k8s-m0
root@linstor-controller-797bc7456f-8mgws:/# linstor r l
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m0 ┊ ┊ Unused ┊ StandAlone(k8s-m1) ┊ Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m1 ┊ ┊ Unused ┊ Connecting(k8s-m0) ┊ UpToDate ┊ 2024-05-06 16:11:31 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m2 ┊ ┊ InUse ┊ Ok ┊ UpToDate ┊ 2024-05-06 16:11:35 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m0 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-29 15:11:23 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-29 15:11:27 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2024-05-29 15:11:26 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m0 ┊ ┊ InUse ┊ Ok ┊ UpToDate ┊ 2024-05-06 16:11:28 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-06 16:11:34 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m0 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-23 16:37:38 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2024-05-23 16:37:43 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-23 16:37:44 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
When I run dmesg -T
on the k8s-m0 node, I get the following output:
[Mo Jun 3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Preparing cluster-wide state change 352557914 (2->-1 3/1)
[Mo Jun 3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Aborting cluster-wide state change 352557914 (0ms) rv = -10
[Mo Jun 3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Declined by peer k8s-m2 (id: 1), see the kernel log there
[Mo Jun 3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Preparing cluster-wide state change 2847615758 (2->-1 3/1)
[Mo Jun 3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Aborting cluster-wide state change 2847615758 (0ms) rv = -10
[Mo Jun 3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Declined by peer k8s-m2 (id: 1), see the kernel log there
[Mo Jun 3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Auto-promote failed: State change was refused by peer node (-10)
[Mo Jun 3 05:16:06 2024] EXT4-fs (drbd1001): INFO: recovery required on readonly filesystem
[Mo Jun 3 05:16:06 2024] EXT4-fs (drbd1001): write access will be enabled during recovery
[Mo Jun 3 05:16:07 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0/0 drbd1001: Rejected WRITE request, not in Primary role. open_cnt:1 [mount:600441:2024-06-03_03:15:46.918]
[Mo Jun 3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 0, lost async page write
[Mo Jun 3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1, lost async page write
[Mo Jun 3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1028, lost async page write
[Mo Jun 3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1043, lost async page write
[Mo Jun 3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1059, lost async page write
[Mo Jun 3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 9267, lost async page write
[Mo Jun 3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 524320, lost async page write
[Mo Jun 3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1572880, lost async page write
[Mo Jun 3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1572896, lost async page write
[Mo Jun 3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1581106, lost async page write
[Mo Jun 3 05:16:07 2024] JBD2: recovery failed
[Mo Jun 3 05:16:07 2024] EXT4-fs (drbd1001): error loading journal
[Mo Jun 3 05:16:10 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0 k8s-m2: Preparing remote state change 502811644
[Mo Jun 3 05:16:10 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0 k8s-m2: Committing remote state change 502811644 (primary_nodes=0)
Create and attach SOS report:
kubectl exec -n piraeus-datastore -it deploy/linstor-controller -- linstor sos-report create