Controller never becomes ready after upgrade to 2.5.2
This is a k8s 1.30 test cluster on Ubuntu Jammy. After upgrade from 2.5.1 to 2.5.2 the controller never becomes ready:
root@k8sw3:~# modinfo drbd
filename: /lib/modules/5.15.0-119-generic/updates/dkms/drbd.ko
softdep: post: handshake
alias: block-major-147-*
license: GPL
version: 9.2.10
description: drbd - Distributed Replicated Block Device v9.2.10
author: Philipp Reisner <[email protected]>, Lars Ellenberg <[email protected]>
srcversion: BADB6D1C4C401B173352EAB
depends: lru_cache,libcrc32c
retpoline: Y
name: drbd
vermagic: 5.15.0-119-generic SMP mod_unload modversions
parm: enable_faults:int
parm: fault_rate:int
parm: fault_count:int
parm: fault_devs:int
parm: disable_sendpage:bool
parm: allow_oos:DONT USE! (bool)
parm: minor_count:Approximate number of drbd devices (1U-255U) (uint)
parm: usermode_helper:string
parm: protocol_version_min:drbd_protocol_version
parm: strict_names:restrict resource and connection names to ascii alnum and a subset of punct (drbd_strict_names)
k8s@k8scp2:~$ kubectl get pods -n piraeus-datastore -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ha-controller-n8hm6 1/1 Running 7 (33m ago) 92d 10.244.5.116 k8sw3.example.com <none> <none>
ha-controller-vn95v 1/1 Running 7 (35m ago) 92d 10.244.3.239 k8sw2.example.com <none> <none>
ha-controller-zhrhc 1/1 Running 7 (33m ago) 92d 10.244.2.216 k8sw1.example.com <none> <none>
linstor-controller-5bc9565d7f-gc9ks 0/1 Running 4 (2m44s ago) 24m 10.244.5.125 k8sw3.example.com <none> <none>
linstor-csi-controller-5d9f5bb46f-cqkj9 0/7 Unknown 37 92d 10.244.2.217 k8sw1.example.com <none> <none>
linstor-csi-controller-6bd5d4dc9c-l7z5q 0/7 Init:0/1 0 24m 10.244.3.247 k8sw2.example.com <none> <none>
linstor-csi-node-2f2lf 0/3 Init:0/1 0 24m 10.244.3.248 k8sw2.example.com <none> <none>
linstor-csi-node-qdt8f 0/3 Init:0/1 0 24m 10.244.2.220 k8sw1.example.com <none> <none>
linstor-csi-node-vrvz7 0/3 Init:0/1 0 24m 10.244.5.124 k8sw3.example.com <none> <none>
linstor-satellite.k8sw1.example.com-cxrzm 2/2 Running 0 24m 10.82.0.23 k8sw1.example.com <none> <none>
linstor-satellite.k8sw2.example.com-8kdsh 2/2 Running 0 24m 10.82.0.24 k8sw2.example.com <none> <none>
linstor-satellite.k8sw3.example.com-tkkzk 2/2 Running 0 24m 10.82.0.26 k8sw3.example.com <none> <none>
piraeus-operator-controller-manager-db9f69747-v49pl 1/1 Running 0 24m 10.244.3.245 k8sw2.example.com <none> <none>
piraeus-operator-gencert-7997864766-rqhdl 1/1 Running 0 24m 10.244.3.246 k8sw2.example.com <none> <none>
k8s@k8scp2:~/linstor$ kubectl logs -n piraeus-datastore linstor-controller-5bc9565d7f-gc9ks
time="2024-08-25T08:21:16Z" level=info msg="running k8s-await-election" version=refs/tags/v0.4.1
time="2024-08-25T08:21:16Z" level=info msg="no status endpoint specified, will not be created"
I0825 08:21:16.899744 1 leaderelection.go:250] attempting to acquire leader lease piraeus-datastore/linstor-controller...
I0825 08:21:16.919634 1 leaderelection.go:260] successfully acquired lease piraeus-datastore/linstor-controller
time="2024-08-25T08:21:16Z" level=info msg="long live our new leader: 'linstor-controller-5bc9565d7f-gc9ks'!"
time="2024-08-25T08:21:16Z" level=info msg="starting command '/usr/bin/piraeus-entry.sh' with arguments: '[startController]'"
LINSTOR, Module Controller
Version: 1.28.0 (959382f7b4fb9436fefdd21dfa262e90318edaed)
Build time: 2024-07-11T10:21:06+00:00
Java Version: 17
Java VM: Debian, Version 17.0.11+9-Debian-1deb12u1
Operating system: Linux, Version 5.15.0-119-generic
Environment: amd64, 2 processors, 1986 MiB memory reserved for allocations
System components initialization in progress
Loading configuration file "/etc/linstor/linstor.toml"
08:21:17.620 [main] INFO LINSTOR/Controller - SYSTEM - ErrorReporter DB version 1 found.
08:21:17.621 [main] INFO LINSTOR/Controller - SYSTEM - Log directory set to: '/var/log/linstor-controller'
08:21:17.640 [main] INFO LINSTOR/Controller - SYSTEM - Database type is Kubernetes-CRD
08:21:17.640 [Main] INFO LINSTOR/Controller - SYSTEM - Loading API classes started.
08:21:18.131 [Main] INFO LINSTOR/Controller - SYSTEM - API classes loading finished: 491ms
08:21:18.131 [Main] INFO LINSTOR/Controller - SYSTEM - Dependency injection started.
08:21:18.139 [Main] INFO LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
08:21:18.140 [Main] INFO LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
08:21:18.140 [Main] INFO LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
08:21:18.149 [Main] INFO LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
08:21:18.149 [Main] INFO LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule"
08:21:18.150 [Main] INFO LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" was successful
08:21:19.707 [Main] INFO LINSTOR/Controller - SYSTEM - Dependency injection finished: 1576ms
08:21:19.708 [Main] INFO LINSTOR/Controller - SYSTEM - Cryptography provider: Using default cryptography module
08:21:20.511 [Main] INFO LINSTOR/Controller - SYSTEM - Initializing authentication subsystem
08:21:21.041 [Main] INFO LINSTOR/Controller - SYSTEM - SpaceTrackingService: Instance added as a system service
08:21:21.043 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
08:21:21.044 [Main] INFO LINSTOR/Controller - SYSTEM - Initializing the k8s crd database connector
08:21:21.044 [Main] INFO LINSTOR/Controller - SYSTEM - Kubernetes-CRD connection URL is "k8s"
08:21:22.463 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'K8sCrdDatabaseService' of type K8sCrdDatabaseService
08:21:22.470 [Main] INFO LINSTOR/Controller - SYSTEM - Security objects load from database is in progress
08:21:23.119 [Main] INFO LINSTOR/Controller - SYSTEM - Security objects load from database completed
08:21:23.121 [Main] INFO LINSTOR/Controller - SYSTEM - Core objects load from database is in progress
08:21:26.580 [Main] INFO LINSTOR/Controller - SYSTEM - Core objects load from database completed
08:21:26.796 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'TaskScheduleService' of type TaskScheduleService
08:21:26.800 [Main] INFO LINSTOR/Controller - SYSTEM - Initializing network communications services
08:21:26.801 [Main] WARN LINSTOR/Controller - SYSTEM - The SSL network communication service 'DebugSslConnector' could not be started because the keyStore file (/etc/linstor/ssl/keystore.jks) is missing
08:21:26.806 [Main] INFO LINSTOR/Controller - SYSTEM - Created network communication service 'PlainConnector'
08:21:26.807 [Main] WARN LINSTOR/Controller - SYSTEM - The SSL network communication service 'SslConnector' could not be started because the keyStore file (/etc/linstor/ssl/keystore.jks) is missing
08:21:26.807 [Main] INFO LINSTOR/Controller - SYSTEM - Created network communication service 'SslConnector'
08:21:26.807 [Main] INFO LINSTOR/Controller - SYSTEM - Reconnecting to previously known nodes
08:21:26.808 [Main] INFO LINSTOR/Controller - SYSTEM - Establishing connection to node 'k8sw1.example.com' via /10.82.0.23:3366 ...
08:21:26.818 [Main] INFO LINSTOR/Controller - SYSTEM - Establishing connection to node 'k8sw2.example.com' via /10.82.0.24:3366 ...
08:21:26.819 [Main] INFO LINSTOR/Controller - SYSTEM - Establishing connection to node 'k8sw3.example.com' via /10.82.0.26:3366 ...
08:21:26.827 [Main] INFO LINSTOR/Controller - SYSTEM - Reconnect requests sent
08:21:26.828 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'SpaceTrackingService' of type SpaceTrackingService
08:21:26.836 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'ScheduleBackupService' of type ScheduleBackupService
08:21:26.837 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'EbsStatusPoll' of type EbsStatusPoll
08:21:26.998 [SpaceTrackingService] ERROR LINSTOR/Controller - SYSTEM - Uncaught exception in j [Report number 66CAE97D-00000-000000]
Aug 25, 2024 8:21:27 AM org.glassfish.jersey.server.wadl.WadlFeature configure
WARNING: JAX-B API not found . WADL feature is disabled.
Aug 25, 2024 8:21:27 AM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [[::]:3370]
Aug 25, 2024 8:21:27 AM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
08:21:27.792 [Main] WARN LINSTOR/Controller - SYSTEM - Not calling 'systemd-notify' as NOTIFY_SOCKET is null
08:21:27.793 [Main] INFO LINSTOR/Controller - SYSTEM - Controller initialized
08:21:27.794 [SslConnector] INFO LINSTOR/Controller - SYSTEM - Sending authentication to satellite 'k8sw1.example.com'
08:21:27.826 [TaskScheduleService] INFO LINSTOR/Controller - SYSTEM - Establishing connection to node 'k8sw3.example.com' via /10.82.0.26:3366 ...
08:21:27.827 [TaskScheduleService] INFO LINSTOR/Controller - SYSTEM - Establishing connection to node 'k8sw2.example.com' via /10.82.0.24:3366 ...
08:21:27.830 [TaskScheduleService] INFO LINSTOR/Controller - SYSTEM - LogArchive: Running log archive on directory: /var/log/linstor-controller
08:21:27.831 [TaskScheduleService] INFO LINSTOR/Controller - SYSTEM - LogArchive: No logs to archive.
08:21:27.832 [TaskScheduleService] INFO LINSTOR/Controller - SYSTEM - BalanceResourcesTask/START
08:21:27.840 [TaskScheduleService] INFO LINSTOR/Controller - SYSTEM - BalanceResourcesTask/END: Adjusted: 0 - Removed: 0
08:21:27.841 [SslConnector] INFO LINSTOR/Controller - SYSTEM - Sending authentication to satellite 'k8sw2.example.com'
08:21:27.849 [SslConnector] INFO LINSTOR/Controller - SYSTEM - Sending authentication to satellite 'k8sw3.example.com'
08:21:28.506 [MainWorkerPool-1] INFO LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw1.example.com' from OFFLINE -> CONNECTED
08:21:28.509 [MainWorkerPool-1] INFO LINSTOR/Controller - SYSTEM - Satellite 'k8sw1.example.com' authenticated
08:21:28.511 [MainWorkerPool-2] INFO LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw2.example.com' from OFFLINE -> CONNECTED
08:21:28.514 [MainWorkerPool-2] INFO LINSTOR/Controller - SYSTEM - Satellite 'k8sw2.example.com' authenticated
08:21:28.515 [MainWorkerPool-2] INFO LINSTOR/Controller - SYSTEM - Sending full sync to Node: 'k8sw2.example.com'.
08:21:28.519 [MainWorkerPool-1] INFO LINSTOR/Controller - SYSTEM - Sending full sync to Node: 'k8sw1.example.com'.
08:21:28.915 [MainWorkerPool-1] INFO LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw3.example.com' from OFFLINE -> CONNECTED
08:21:28.931 [MainWorkerPool-1] INFO LINSTOR/Controller - SYSTEM - Satellite 'k8sw3.example.com' authenticated
08:21:28.932 [MainWorkerPool-1] INFO LINSTOR/Controller - SYSTEM - Sending full sync to Node: 'k8sw3.example.com'.
08:21:29.288 [MainWorkerPool-2] INFO LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw3.example.com' from CONNECTED -> ONLINE
08:21:29.472 [MainWorkerPool-1] INFO LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw1.example.com' from CONNECTED -> ONLINE
08:21:29.487 [MainWorkerPool-2] INFO LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw2.example.com' from CONNECTED -> ONLINE
k8s@k8scp2:~$ kubectl describe -n piraeus-datastore pod linstor-controller-5bc9565d7f-gc9ks
Name: linstor-controller-5bc9565d7f-gc9ks
Namespace: piraeus-datastore
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: linstor-controller
Node: k8sw3.example.com/10.82.0.26
Start Time: Sun, 25 Aug 2024 07:59:36 +0000
Labels: app.kubernetes.io/component=linstor-controller
app.kubernetes.io/instance=linstorcluster
app.kubernetes.io/name=piraeus-datastore
pod-template-hash=5bc9565d7f
Annotations: kubectl.kubernetes.io/default-container: linstor-controller
Status: Running
IP: 10.244.5.125
IPs:
IP: 10.244.5.125
Controlled By: ReplicaSet/linstor-controller-5bc9565d7f
Init Containers:
run-migration:
Container ID: containerd://079a193ebd297dacb65d44f98e71ef28ff406e5395569139577ddaf335813bb1
Image: quay.io/piraeusdatastore/piraeus-server:v1.28.0
Image ID: quay.io/piraeusdatastore/piraeus-server@sha256:cc21d7daf44f7f328f9175ff443a0d2289efa3038182d8892ce753d9e67178a3
Port: <none>
Host Port: <none>
Args:
runMigration
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 25 Aug 2024 08:00:37 +0000
Finished: Sun, 25 Aug 2024 08:01:22 +0000
Ready: True
Restart Count: 0
Environment:
K8S_AWAIT_ELECTION_ENABLED: 1
K8S_AWAIT_ELECTION_NAME: linstor-controller
K8S_AWAIT_ELECTION_LOCK_NAME: linstor-controller
K8S_AWAIT_ELECTION_LOCK_NAMESPACE: piraeus-datastore (v1:metadata.namespace)
K8S_AWAIT_ELECTION_IDENTITY: linstor-controller-5bc9565d7f-gc9ks (v1:metadata.name)
Mounts:
/etc/linstor from etc-linstor (ro)
/run/migration from migration (rw)
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pkqjc (ro)
Containers:
linstor-controller:
Container ID: containerd://a0e046313a75f00c56a29de1ae58f980526705be3a411c6d4d1fbf5e12773848
Image: quay.io/piraeusdatastore/piraeus-server:v1.28.0
Image ID: quay.io/piraeusdatastore/piraeus-server@sha256:cc21d7daf44f7f328f9175ff443a0d2289efa3038182d8892ce753d9e67178a3
Port: 3370/TCP
Host Port: 0/TCP
Args:
startController
State: Running
Started: Sun, 25 Aug 2024 08:21:16 +0000
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Sun, 25 Aug 2024 08:16:16 +0000
Finished: Sun, 25 Aug 2024 08:21:16 +0000
Ready: False
Restart Count: 4
Liveness: http-get http://:api/health delay=0s timeout=1s period=10s #success=1 #failure=3
Startup: http-get http://:api/health delay=0s timeout=1s period=10s #success=1 #failure=30
Environment:
JAVA_OPTS: -Djdk.tls.acknowledgeCloseNotify=true
K8S_AWAIT_ELECTION_ENABLED: 1
K8S_AWAIT_ELECTION_NAME: linstor-controller
K8S_AWAIT_ELECTION_LOCK_NAME: linstor-controller
K8S_AWAIT_ELECTION_LOCK_NAMESPACE: piraeus-datastore (v1:metadata.namespace)
K8S_AWAIT_ELECTION_IDENTITY: linstor-controller-5bc9565d7f-gc9ks (v1:metadata.name)
Mounts:
/etc/linstor from etc-linstor (ro)
/tmp from tmp (rw)
/var/log/linstor-controller from var-log-linstor-controller (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pkqjc (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
etc-linstor:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: linstor-controller-config
Optional: false
var-log-linstor-controller:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
migration:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-pkqjc:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: worker=true
Tolerations: drbd.linbit.com/force-io-error:NoSchedule
drbd.linbit.com/lost-quorum:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25m default-scheduler Successfully assigned piraeus-datastore/linstor-controller-5bc9565d7f-gc9ks to k8sw3.example.com
Normal Pulling 25m kubelet Pulling image "quay.io/piraeusdatastore/piraeus-server:v1.28.0"
Normal Pulled 24m kubelet Successfully pulled image "quay.io/piraeusdatastore/piraeus-server:v1.28.0" in 52.078s (1m0.455s including waiting). Image size: 323505259 bytes.
Normal Created 24m kubelet Created container run-migration
Normal Started 24m kubelet Started container run-migration
Normal Pulled 23m kubelet Container image "quay.io/piraeusdatastore/piraeus-server:v1.28.0" already present on machine
Normal Created 23m kubelet Created container linstor-controller
Normal Started 23m kubelet Started container linstor-controller
Warning Unhealthy 23m kubelet Startup probe failed: Get "http://10.244.5.125:3370/health": dial tcp 10.244.5.125:3370: connect: connection refused
Warning Unhealthy 33s (x139 over 23m) kubelet Startup probe failed: HTTP probe failed with statuscode: 500
Happy to provide more info if told where to look.
08:21:26.998 [SpaceTrackingService] ERROR LINSTOR/Controller - SYSTEM - Uncaught exception in j [Report number 66CAE97D-00000-000000]
For some reason this service does not want to start. This could be related to some corrupted DB entries. You can try to clean the space tracking database:
- Create a backup of the entries:
kubectl get -oyaml spacehistory.internal.linstor.linbit.com > spacehistory.yaml. Attach it to this issue so we can debug the issue - Delete the entries:
kubectl delete spacehistory.internal.linstor.linbit.com --alland see if the controller starts.
Deleting the space history did indeed make the controller become ready again. File attached. spacehistory.txt
Same issue here, deleting spacehistory resolved it. I can send my db if needed.
We experienced this again on our production cluster when upgrading from 2.5.1. Luckily we knew about the workaround.
Same issue here, deleting spacehistory resolved it.