linstor-server Database entry of table LAYER_DRBD

I forgot to disable a linstor schedule resulting in quite some snapshots. After removing ~1048 snapshot backups the database seems to got corrupted.

While the schedule was configured to keep only a few copies local I ended up having >1000 snapshots and s3 objects.

LINSTOR ==> schedule list
╭───────────────────────────────────────────────────────────────────────╮
┊ Name  ┊ Full      ┊ Incremental  ┊ KeepLocal ┊ KeepRemote ┊ OnFailure ┊
╞═══════════════════════════════════════════════════════════════════════╡
┊ Daily ┊ 2 * * * * ┊ 2/15 * * * * ┊ 2         ┊ 4          ┊ RETRY(2)  ┊
╰───────────────────────────────────────────────────────────────────────╯

A partial list of the snapshots: partial_snapshot_list.txt

Removing the backups from S3 (linstor backup delete all s3) worked as in that the files are gone but errored out with the following;

root@linstor-controller-6c455fb579-bbk62:/# linstor backup delete all s3
ERROR:
    attempt to replace an active transMgr
Show reports:
    linstor error-reports show 692965E4-00000-000909

Report: report_692965E4-00000-000909.txt

At this point snapshots were in a deleting state. Removing them using the Web UI or linstor CLI yielded similar attempt to replace an active transMgr errors. At this point I restarted the controller which failed with the error in question;

linstor-controller time="2025-12-08T15:32:25Z" level=info msg="running k8s-await-election" version=refs/tags/v0.4.1
linstor-controller time="2025-12-08T15:32:25Z" level=info msg="no status endpoint specified, will not be created"
linstor-controller I1208 15:32:25.887402       1 leaderelection.go:250] attempting to acquire leader lease piraeus-datastore/linstor-controller...
linstor-controller I1208 15:32:25.903707       1 leaderelection.go:260] successfully acquired lease piraeus-datastore/linstor-controller
linstor-controller time="2025-12-08T15:32:25Z" level=info msg="long live our new leader: 'linstor-controller-5fff475694-9ln29'!"
linstor-controller time="2025-12-08T15:32:25Z" level=info msg="starting command '/usr/bin/piraeus-entry.sh' with arguments: '[startController]'"
linstor-controller LINSTOR, Module Controller
linstor-controller Version:            1.32.1 (e04f98efc3aeb643cf109ffd322a4f2506000da1)
linstor-controller Build time:         2025-09-16T09:03:12+00:00 Log v2
linstor-controller Java Version:       17
linstor-controller Java VM:            Debian, Version 17.0.16+8-Debian-1deb12u1
linstor-controller Operating system:   Linux, Version 6.12.57-talos
linstor-controller Environment:        amd64, 4 processors, 8192 MiB memory reserved for allocations
linstor-controller 
linstor-controller 
linstor-controller System components initialization in progress
linstor-controller 
linstor-controller Loading configuration file "/etc/linstor/linstor.toml"
linstor-controller 2025-12-08 15:32:26.988 [main] INFO  LINSTOR/Controller/ffffff SYSTEM - ErrorReporter DB version 1 found.
linstor-controller 2025-12-08 15:32:26.991 [main] INFO  LINSTOR/Controller/ffffff SYSTEM - Log directory set to: '/var/log/linstor-controller'
linstor-controller 2025-12-08 15:32:27.032 [main] INFO  LINSTOR/Controller/ffffff SYSTEM - Database type is Kubernetes-CRD
linstor-controller 2025-12-08 15:32:27.033 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Loading API classes started.
linstor-controller 2025-12-08 15:32:27.482 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - API classes loading finished: 449ms
linstor-controller 2025-12-08 15:32:27.482 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Dependency injection started.
linstor-controller 2025-12-08 15:32:27.499 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
linstor-controller 2025-12-08 15:32:27.500 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
linstor-controller 2025-12-08 15:32:27.500 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
linstor-controller 2025-12-08 15:32:27.515 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
linstor-controller 2025-12-08 15:32:27.515 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule"
linstor-controller 2025-12-08 15:32:27.516 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" was successful
linstor-controller 2025-12-08 15:32:28.573 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Dependency injection finished: 1090ms
linstor-controller 2025-12-08 15:32:28.574 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Cryptography provider: Using default cryptography module
linstor-controller 2025-12-08 15:32:28.928 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Initializing authentication subsystem
linstor-controller 2025-12-08 15:32:29.265 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - SpaceTrackingService: Instance added as a system service
linstor-controller 2025-12-08 15:32:29.266 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
linstor-controller 2025-12-08 15:32:29.267 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Initializing the k8s crd database connector
linstor-controller 2025-12-08 15:32:29.267 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Kubernetes-CRD connection URL is "k8s"
linstor-controller 2025-12-08 15:32:31.462 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Starting service instance 'K8sCrdDatabaseService' of type K8sCrdDatabaseService
linstor-controller 2025-12-08 15:32:31.473 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Security objects load from database is in progress
linstor-controller 2025-12-08 15:32:31.928 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Security objects load from database completed
linstor-controller 2025-12-08 15:32:31.928 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Core objects load from database is in progress
linstor-controller 2025-12-08 15:32:34.918 [Main[] ERROR LINSTOR/Controller/ffffff SYSTEM - Database entry of table LAYER_DRBD_VOLUMES could not be restored. [Report number 6936EF8A-00000-000000]
linstor-controller 
linstor-controller 2025-12-08 15:32:34.922 [Main[] ERROR LINSTOR/Controller/ffffff SYSTEM - Unhandled exception [Report number 6936EF8A-00000-000001]
linstor-controller 
linstor-controller 2025-12-08 15:32:34.950 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Shutdown in progress
linstor-controller 2025-12-08 15:32:34.950 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Shutting down service instance 'EbsStatusPoll' of type EbsStatusPoll
linstor-controller 2025-12-08 15:32:34.950 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Shutting down service instance 'ScheduleBackupService' of type ScheduleBackupService
linstor-controller 2025-12-08 15:32:34.950 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Shutting down service instance 'SpaceTrackingService' of type SpaceTrackingService
linstor-controller 2025-12-08 15:32:34.951 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Shutting down service instance 'TaskScheduleService' of type TaskScheduleService
linstor-controller 2025-12-08 15:32:34.951 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Shutting down service instance 'K8sCrdDatabaseService' of type K8sCrdDatabaseService
linstor-controller 2025-12-08 15:32:34.968 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Shutting down service instance 'TimerEventService' of type TimerEventService
linstor-controller 2025-12-08 15:32:34.969 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Waiting for service instance 'EbsStatusPoll' to complete shutdown
linstor-controller 2025-12-08 15:32:34.969 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Waiting for service instance 'ScheduleBackupService' to complete shutdown
linstor-controller 2025-12-08 15:32:34.969 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Waiting for service instance 'SpaceTrackingService' to complete shutdown
linstor-controller 2025-12-08 15:32:34.969 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Waiting for service instance 'TaskScheduleService' to complete shutdown
linstor-controller 2025-12-08 15:32:34.969 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Waiting for service instance 'K8sCrdDatabaseService' to complete shutdown
linstor-controller 2025-12-08 15:32:34.970 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Waiting for service instance 'TimerEventService' to complete shutdown
linstor-controller 2025-12-08 15:32:34.970 [Thread-2] INFO  LINSTOR/Controller/a85651 SYSTEM - Shutdown complete
linstor-controller time="2025-12-08T15:32:35Z" level=fatal msg="failed to run" err="exit status 20"
stream closed: EOF for piraeus-datastore/linstor-controller-5fff475694-9ln29 (linstor-controller)

I have complete logs stored in Loki (for 14 or so more days) so if more information is required let me know. I've left the (non production) cluster in the same state so debugging it is an option.

Dec 09 '25 14:12 hkraal

Please get /var/log/linstor-controller/ErrorReport-6936EF8A-00000-000000.log from the controller so that we can see what prevents the controller from starting. Even if this is just a test-cluster and even if you would be fine with throwing away everything, that ErrorReport could give us an additional hint in how the database got corrupted... at least a small chance 🙂.

Regarding the "keep" property not working properly as well as the "attempt to replace an active transMgr" - i think we should have enough information to dig deeper into those, so thanks for the report.

After such an "attempt to replace an active transMgr" (which should not have happened in the first place of course) restarting the controller was the right thing to do. There is no other way to get out of that situation. That the database is also corrupt is most likely a separate issue and I would surely like to know when the corruption occurred in the first place. Not sure if we will be able to get to the root of that since that could have happened days or even weeks ago.

Dec 09 '25 14:12 ghernadi

Sorry for the delay but here is the report. Let me know if you need anything else!

report_6936EF8A-00000-000000.txt

After such an "attempt to replace an active transMgr" (which should not have happened in the first place of course) restarting the controller was the right thing to do. There is no other way to get out of that situation. That the database is also corrupt is most likely a separate issue and I would surely like to know when the corruption occurred in the first place. Not sure if we will be able to get to the root of that since that could have happened days or even weeks ago.

The cluster isn't running piraeus for a very long time so the chance of the causing being in the last 7 days is pretty big. I've had some shenanigans in my attempts to restore a PVC (pvc-restored-mealie) which I backupped to a linstore s3 remote to test the restore procedure. My success with linstore restore was limited and I ended up using the VolumeSnapshotContent methode which worked.

I could imagen that In my attempts to clean up my earlier remnants I might have hit the wrong button. Doing a "list 100 snapshots in the Web UI -> select all -> delete" was one of my first actions I think.

Dec 09 '25 16:12 hkraal

Sorry for the delay but here is the report. Let me know if you need anything else!

report_6936EF8A-00000-000000.txt

Thanks. Would you mind sharing the database also? Either attach it here or send it me via email (see my profile).

Since your controller does not work a linstor sos dl is not an option, but you could still try a

/usr/share/linstor-server/bin/linstor-database export-db -c /etc/linstor /root/exported-db.json

If that does not work either, than simply grab somehow the CRD entries directly with something like

mkdir -p /root/k8s; kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs -i{} sh -c "kubectl get {} -oyaml > /myhome/k8s/{}.yaml"; kubectl get crd -oyaml > /root/k8s_crds.yaml

and tar+gz all of that.

After such an "attempt to replace an active transMgr" (which should not have happened in the first place of course) restarting the controller was the right thing to do. There is no other way to get out of that situation. That the database is also corrupt is most likely a separate issue and I would surely like to know when the corruption occurred in the first place. Not sure if we will be able to get to the root of that since that could have happened days or even weeks ago.

The cluster isn't running piraeus for a very long time so the chance of the causing being in the last 7 days is pretty big. I've had some shenanigans in my attempts to restore a PVC (pvc-restored-mealie) which I backupped to a linstore s3 remote to test the restore procedure. My success with linstore restore was limited and I ended up using the VolumeSnapshotContent methode which worked.

I could imagen that In my attempts to clean up my earlier remnants I might have hit the wrong button. Doing a "list 100 snapshots in the Web UI -> select all -> delete" was one of my first actions I think.

That should be fine as long as you used linstor commands (directly or indirectly) and not modified the database directly. The ErrorReport states that a DrbdVolume is missing its parent DrbdResource. That is simply an inconsistency in the database and should never happen. The entire "please send me the database" is only to figure out the actual resource name in the hopes that it might ring a bell to you regarding one of the mentioned shenanigans. Iirc there are already a few GitHub issues open similar to this one (inconsistent CRD database) but I was never able to figure out when the inconsistency happens or what triggers it (and therefore how to properly fix this :/ )

Dec 09 '25 16:12 ghernadi

Thanks. Would you mind sharing the database also? Either attach it here or send it me via email (see my profile).

Since your controller does not work a linstor sos dl is not an option, but you could still try a
/usr/share/linstor-server/bin/linstor-database export-db -c /etc/linstor /root/exported-db.json
If that does not work either, than simply grab somehow the CRD entries directly with something like

As the controller isn't running it a bit of a puzzle to get the files together. I can access the emptyDir volumes on the host, spawn a new container in the Pod but not actually target the linstor-controller container with kubectl debug as it's not running.

Unless you have a suggestion how to get the I fears it's CRD's all the way. I've mailed them to you.

That should be fine as long as you used linstor commands (directly or indirectly) and not modified the database directly. The ErrorReport states that a DrbdVolume is missing its parent DrbdResource. That is simply an inconsistency in the database and should never happen. The entire "please send me the database" is only to figure out the actual resource name in the hopes that it might ring a bell to you regarding one of the mentioned shenanigans. Iirc there are already a few GitHub issues open similar to this one (inconsistent CRD database) but I was never able to figure out when the inconsistency happens or what triggers it (and therefore how to properly fix this :/ )

Yeah, my actions were linstor cli / web UI / kubernetes resources only. I hope the gathered information helps you in narrowing down the cause of similar issues. Let me know if I can help in any way!

Dec 09 '25 20:12 hkraal

As this might be related to the root cause I tought I should mention it;

After restoring my piraeus-controller backup (resetting to 0 effectively) I started cleaning up the orphaned LV volumes. Some LV's were still in use despite that there should have been only 1. It seemed that the removal of my mimir-test namespace on 2025-12-08 15:30:00 did not clean up it's pvc's all the way.

lsblk

root@worker-p02-n04:/# lsblk /dev/xvdb
NAME                                                          MAJ:MIN  RM   SIZE RO TYPE MOUNTPOINTS
xvdb                                                          202:16    0   500G  0 disk
|-vg0-thinpool_tmeta                                          251:0     0   128M  0 lvm
| `-vg0-thinpool-tpool                                        251:2     0 499.7G  0 lvm
|   |-vg0-thinpool                                            251:3     0 499.7G  1 lvm
|   |-vg0-pvc--470bdcb1--6368--416f--9463--99d48a8800ad_00000 251:5     0     5G  0 lvm
|   | `-drbd1002                                              147:1002  0     5G  0 disk
|   |-vg0-pvc--26a110b4--8813--4e6f--9782--961c10a4953a_00000 251:6     0     2G  0 lvm
|   | `-drbd1004                                              147:1004  0     2G  0 disk
|   |-vg0-pvc--b344515a--25c3--4aef--a78e--32ecd5fc2d51_00000 251:7     0     2G  0 lvm
|   | `-drbd1008                                              147:1008  0     2G  0 disk
|   |-vg0-pvc--b91ca166--b8b2--4c79--8146--ee68d089d864_00000 251:8     0     2G  0 lvm
|   | `-drbd1012                                              147:1012  0     2G  0 disk
|   `-vg0-pvc--61d4b4ff--535a--41e7--80b0--8f01bfe840be_00000 251:149   0     1G  0 lvm
|     `-drbd1014                                              147:1014  0     1G  0 disk
`-vg0-thinpool_tdata                                          251:1     0 499.7G  0 lvm
  `-vg0-thinpool-tpool                                        251:2     0 499.7G  0 lvm
    |-vg0-thinpool                                            251:3     0 499.7G  1 lvm
    |-vg0-pvc--470bdcb1--6368--416f--9463--99d48a8800ad_00000 251:5     0     5G  0 lvm
    | `-drbd1002                                              147:1002  0     5G  0 disk
    |-vg0-pvc--26a110b4--8813--4e6f--9782--961c10a4953a_00000 251:6     0     2G  0 lvm
    | `-drbd1004                                              147:1004  0     2G  0 disk
    |-vg0-pvc--b344515a--25c3--4aef--a78e--32ecd5fc2d51_00000 251:7     0     2G  0 lvm
    | `-drbd1008                                              147:1008  0     2G  0 disk
    |-vg0-pvc--b91ca166--b8b2--4c79--8146--ee68d089d864_00000 251:8     0     2G  0 lvm
    | `-drbd1012                                              147:1012  0     2G  0 disk
    `-vg0-pvc--61d4b4ff--535a--41e7--80b0--8f01bfe840be_00000 251:149   0     1G  0 lvm
      `-drbd1014                                              147:1014  0     1G  0 disk

drbdadm status

root@worker-p02-n04:/# drbdsetup status -v
pvc-26a110b4-8813-4e6f-9782-961c10a4953a node-id:0 role:Secondary suspended:no force-io-failures:no
  volume:0 minor:1004 disk:UpToDate backing_dev:/dev/vg0/pvc-26a110b4-8813-4e6f-9782-961c10a4953a_00000 quorum:yes open:no blocked:no

pvc-470bdcb1-6368-416f-9463-99d48a8800ad node-id:0 role:Secondary suspended:no force-io-failures:no
  volume:0 minor:1002 disk:UpToDate backing_dev:/dev/vg0/pvc-470bdcb1-6368-416f-9463-99d48a8800ad_00000 quorum:yes open:no blocked:no

pvc-61d4b4ff-535a-41e7-80b0-8f01bfe840be node-id:0 role:Secondary suspended:no force-io-failures:no
  volume:0 minor:1014 disk:UpToDate backing_dev:/dev/vg0/pvc-61d4b4ff-535a-41e7-80b0-8f01bfe840be_00000 quorum:yes open:no blocked:no
  worker-p02-n01 node-id:1 connection:StandAlone role:Unknown tls:no congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Off peer-disk:DUnknown resync-suspended:no

pvc-b344515a-25c3-4aef-a78e-32ecd5fc2d51 node-id:0 role:Secondary suspended:no force-io-failures:no
  volume:0 minor:1008 disk:UpToDate backing_dev:/dev/vg0/pvc-b344515a-25c3-4aef-a78e-32ecd5fc2d51_00000 quorum:yes open:no blocked:no

pvc-b91ca166-b8b2-4c79-8146-ee68d089d864 node-id:0 role:Secondary suspended:no force-io-failures:no
  volume:0 minor:1012 disk:UpToDate backing_dev:/dev/vg0/pvc-b91ca166-b8b2-4c79-8146-ee68d089d864_00000 quorum:yes open:no blocked:no

Note; I've manually disconnected these volumes from within the linstor-satellite's so that's all good now as far as I can tell.

Dec 10 '25 13:12 hkraal