Error creating rollback entry - Resource SecObjectProtection not found
hello!
Urgent help needed!
OS: Oracle Linux 8 Kernel: UEK 5.15.0-308.179.6.el8uek.x86_64 piraeus-operator: v2.8.0 velero: v1.15.2
Got an error during backup of ANY volume:
time="2025-05-29T15:24:24Z" level=warning msg="VolumeSnapshotContent snapcontent-5fefd56c-f43b-45cf-8998-848f9b119661 has error: Failed to check and update snapshot content: failed to take snapshot of the volume pvc-e60b0f01-dfe4-4237-b4aa-b59aa70828b6: \"rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[6822F3BD-00000-001896]'\"" backup=velero/ksy-20250529-02 cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:744" pluginName=velero
ERROR REPORT 6822F3BD-00000-001896
============================================================
Application: LINBIT? LINSTOR
Module: Controller
Version: 1.30.4
Build ID: bef74a44609cb592c5efad2e707b50e696623c61
Build time: 2025-02-03T15:48:28+00:00
Error time: 2025-05-29 15:23:46
Node: linstor-controller-5dd47c5b84-7rftt
Thread: grizzly-http-server-8
============================================================
Reported error:
===============
Category: RuntimeException
Class name: ApiTransactionException
Class canonical name: com.linbit.linstor.core.apicallhandler.response.ApiTransactionException
Generated at: Method 'commit', Source file 'CtrlTransactionHelper.java', Line #30
Error message: com.linbit.linstor.transaction.TransactionException: Error creating rollback entry
Call backtrace:
Method Native Class:Line number
commit N com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:30
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
lambda$modifyKVStore$2 N com.linbit.linstor.api.rest.v1.KeyValueStore:104
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:235
modifyKVStore N com.linbit.linstor.api.rest.v1.KeyValueStore:97
invoke N jdk.internal.reflect.GeneratedMethodAccessor204:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:569
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:478
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:400
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:81
run N org.glassfish.jersey.server.ServerRuntime$1:256
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:235
handle N org.glassfish.jersey.server.ApplicationHandler:684
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:190
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
run N java.lang.Thread:840
Caused by:
==========
Category: RuntimeException
Class name: TransactionException
Class canonical name: com.linbit.linstor.transaction.TransactionException
Generated at: Method 'commit', Source file 'ControllerK8sCrdTransactionMgr.java', Line #157
Error message: Error creating rollback entry
ErrorContext:
Call backtrace:
Method Native Class:Line number
commit N com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:157
commit N com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
lambda$modifyKVStore$2 N com.linbit.linstor.api.rest.v1.KeyValueStore:104
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:235
modifyKVStore N com.linbit.linstor.api.rest.v1.KeyValueStore:97
invoke N jdk.internal.reflect.GeneratedMethodAccessor204:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:569
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:478
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:400
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:81
run N org.glassfish.jersey.server.ServerRuntime$1:256
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:235
handle N org.glassfish.jersey.server.ApplicationHandler:684
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:190
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
run N java.lang.Thread:840
Caused by:
==========
Category: LinStorException
Class name: DatabaseException
Class canonical name: com.linbit.linstor.dbdrivers.DatabaseException
Generated at: Method 'createRollbackEntry', Source file 'ControllerK8sCrdRollbackMgr.java', Line #84
Error message: Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not found
ErrorContext:
Call backtrace:
Method Native Class:Line number
createRollbackEntry N com.linbit.linstor.transaction.ControllerK8sCrdRollbackMgr:84
commit N com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:152
commit N com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
lambda$modifyKVStore$2 N com.linbit.linstor.api.rest.v1.KeyValueStore:104
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:235
modifyKVStore N com.linbit.linstor.api.rest.v1.KeyValueStore:97
invoke N jdk.internal.reflect.GeneratedMethodAccessor204:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:569
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:478
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:400
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:81
run N org.glassfish.jersey.server.ServerRuntime$1:256
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:235
handle N org.glassfish.jersey.server.ApplicationHandler:684
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:190
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
run N java.lang.Thread:840
END OF ERROR REPORT.
I have checked here it's recommended to create the problematic resource, but with which contents should I do it? I don't find any mentions of this resource in etcd and it's triggered for ANY volume we are trying to backup with the same SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686.
Hello, sorry for the delayed response.
Try this content:
apiVersion: "internal.linstor.linbit.com/v1-27-1"
kind: "SecObjectProtection"
metadata:
name: "b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686"
spec:
object_path: "/keyvaluestores/CSI-BACKUP-MAPPING"
creator_identity_name: "PUBLIC"
owner_role_name: "PUBLIC"
security_type_name: "PUBLIC"
If there are also other entries missing, I'd like to ask you to send a k8s dump to my email so I can see if I can fix this for you.
Hello and thanks for the answer! Unfortunately the suggested fix did not help. I have created the object but still get the same error
time="2025-06-02T14:31:23Z" level=info msg="Created VolumeSnapshot namespace/velero-test-pvc-xfs-wgpb6" backup=velero/ksy-20250602-02 cmd=/velero logSource="pkg/backup/actions/csi/pvc_action.go:187" pluginName=velero
time="2025-06-02T14:31:23Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot namespace/velero-test-pvc-xfs-wgpb6. Retrying in 5s" backup=velero/ksy-20250602-02 cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:712" pluginName=velero
time="2025-06-02T14:31:28Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-fe1ccd23-40fa-46f9-91ea-888955e6f96d to have snapshot handle. Retrying in 5s" backup=velero/ksy-20250602-02 cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2025-06-02T14:31:28Z" level=warning msg="VolumeSnapshotContent snapcontent-fe1ccd23-40fa-46f9-91ea-888955e6f96d has error: Failed to check and update snapshot content: failed to take snapshot of the volume pvc-e60b0f01-dfe4-4237-b4aa-b59aa70828b6: \"rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[6822F3BD-00000-002865]'\"" backup=velero/ksy-20250602-02 cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:744" pluginName=velero
kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor error-reports show 6822F3BD-00000-002865
...
Category: LinStorException
Class name: DatabaseException
Class canonical name: com.linbit.linstor.dbdrivers.DatabaseException
Generated at: Method 'createRollbackEntry', Source file 'ControllerK8sCrdRollbackMgr.java', Line #84
Error message: Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not found
ErrorContext:
...
kubectl get secobjectprotection.internal.linstor.linbit.com/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 -o yaml
apiVersion: internal.linstor.linbit.com/v1-15-0
kind: SecObjectProtection
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"internal.linstor.linbit.com/v1-15-0","kind":"SecObjectProtection","metadata":{"annotations":{},"name":"b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686"},"spec":{"creator_identity_name":"PUBLIC","object_path":"/keyvaluestores/CSI-BACKUP-MAPPING","owner_role_name":"PUBLIC","security_type_name":"PUBLIC"}}
creationTimestamp: "2025-06-02T14:31:01Z"
generation: 1
name: b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686
resourceVersion: "2065499381"
uid: 5f038f8b-e159-447c-8151-db3c8aa56bd0
spec:
creator_identity_name: PUBLIC
object_path: /keyvaluestores/CSI-BACKUP-MAPPING
owner_role_name: PUBLIC
security_type_name: PUBLIC
Sorry, I forgot that LINSTOR is internally caching a few things. That would at least explain the discrepancy between LINSTOR'S "not found" vs kubectl get ... "exists" status.
However, restarting the controller could lead to the problem that if the database is not consistent right now, the controller might not start properly. So either try to clone your k8s database to some playground project and try to start the controller there, or just send me a dump of your k8s database to my email address so I can verify the database consistency.
If you want to try it yourself, please make sure that the dummy controller cannot reach the actual satellites to avoid unnecessary reconnects and OFFLINE(Other controller) status messages from the actual controller.
If it turns out that there are other issues in the database, I guess you will need to send me the database anyways, since it can be quite cumbersome to figure out what is missing.
Yes I'm also afraid to restart the controller. We will discuss it in our team and I will try what you suggested or send you the dump later on, thank you again for your help.
Hi! The database was OK so we restarted the controller, the error disappeared but now there's a new error:
2025-06-05 05:55:59.231 [DeviceManager] ERROR LINSTOR/Satellite/2fcb4d SYSTEM - Generated resource file for resource 'pvc-918df6b7-f2ac-4edb-8a3c-0fd5fe8e5e19' is invalid. [Report number 68406749-5C801-000023]```
The external command sent the following error information:
/var/lib/linstor.d/pvc-918df6b7-f2ac-4edb-8a3c-0fd5fe8e5e19.res_tmp:103: in resource pvc-918df6b7-f2ac-4edb-8a3c-0fd5fe8e5e19
ipv4:10.1.1.18:7023 is also used /var/lib/linstor.d/pvc-3279d999-4a6d-4c8e-bf6b-2536f091b8ab.res:71 (resource pvc-3279d999-4a6d-4c8e-bf6b-2536f091b8ab)
/var/lib/linstor.d/pvc-918df6b7-f2ac-4edb-8a3c-0fd5fe8e5e19.res_tmp:104: in resource pvc-918df6b7-f2ac-4edb-8a3c-0fd5fe8e5e19
ipv4:10.1.1.15:7023 is also used /var/lib/linstor.d/pvc-3279d999-4a6d-4c8e-bf6b-2536f091b8ab.res:72 (resource pvc-3279d999-4a6d-4c8e-bf6b-2536f091b8ab)
command sh-nop exited with code 10
Both PVC-s were created during velero backup.
Can you check if both PVCs still exist in linstor rd l and also if both really have the same TCP port? If so, I guess we will need to fix this in the DB, but I'd be curious how you ended up with duplicated TCP ports..
We have already fixed it I hope.
I can't say exactly if there were both resources in 'r l' but definitely at least one of them was hanging in Connecting state. When the backup was still running I could see that these both PVCs were temporary PVCs created by velero. When the backup was finished the resource was still in Connecting state and there were 3 rd in the database not belonging to any PVC. The problematics PVCs were no longer listed in the API. We deleted the dangling rd-s manually, one of them was failing until we found and also deleted the corresponding PV which still was seen in K8s API. Now the number of rd-s and resources corresponds to the number of PVCs.
Maybe I will also recheck in the database if something related to them is still left.
P.S. Once or twice we already had this issue with a duplicate port after backups, we had to find all the corresponding objects in the db and delete them manually.
Deleting entries from the database manually is always a bit risky since you can overlook some entries and end up with orphaned data. The issue here is not just cluttering the database but also new entries might accidentally relate to those entries, making it nearly impossible to figure out how one got into this state.
This is why I usually try to figure out if there is really the same port in the database itself or if we only have some orphaned .res file laying around and drbdadm finds that when checking the validity of .res files. In the latter case cleaning up the .res file should be preferred than trying to manipulate the database manually.
Thank you @ghernadi . So could you please confirm that the most appropriate action in case we face the situation again is to delete the problematic .res file from the satellite? Could you please give some recommendations how we can trace the same port re-use scenario? I can collect the logs if needed.
The last backup partially failed with the error we have already faced: https://github.com/piraeusdatastore/piraeus-operator/issues/824
Yes. To my understanding there are just 2 possibilities how you can end up in a duplicated TCP port scenario (or duplicated minor number).
- LINSTOR made a mistake and the same TCP port is actually in the database. You should be able to check this using
rd land check for duplicated ports (or in future this will be moved tor l --show-drbd-portsor something like that) - There is some other .res file lying around on your satellite that LINSTOR does not know about (or maybe "no longer knows about"). When running any
drbdadmcommand,drbdadmscans all available .res files and runs some uniqueness checks, for example for TCP ports, minor numbers, etc... That meansdrbdadmdoes indeed take .res files into account that are not known by LINSTOR, maybe not even managed by LINSTOR, so thatdrbdadmcan complain about duplicated settings.
In the first case, there is not much you can do besides trying to delete the resource if the controller is still up and running. If the controller is already offline, it may not start properly due to the duplicated entry. In this case you will need to clean up the DB itself (which is quite cumbersome as mentioned earlier)
In the latter case, you are correct with your assumption. You can try restart the satellite (not the controller). When the satellite starts, it first deletes all .res files in /var/lib/linstor.d (unless LS_KEEP_RES is used). Once it receives the data from the controller, the satellite rewrites all needed .res files and checks if the DRBD resources need to be adjusted.
Now that I have mentioned this orphaned .res file scenario: In 1.31.1 we released a fix for a bug where if you had a DRBD resource manually downed (drbdadm down $rsc) and deleted the LINSTOR resource while the corresponding DRBD resource was down, LINSTOR did not properly delete the .res file. LINSTOR would in this case re-use the TCP port for the next created resource since the previous resource was deleted from the LINSTOR database. The still remaining orphaned .res file on the satellite could cause the issue described in 2.
So if it is possible that you had such an incident, you might want to upgrade LINSTOR to avoid this in future.
Today we have faced the situation with the duplicate port again after velero backup. I can confirm the duplicate port was not in rd l. There was only one record with the mentioned port in rd l and it was some dangling velero PVC left after the backup, the PVC was no longer in the cluster, only the PV was left.
I checked the .res files at the satellites and there were some res_tmp files using the same port. PVC-s corresponding to these resources were no longer in the cluster. I made some combination of deleting the res_tmp resources, restarting the satellites and so on and finally the issue was resolved.
We are planing the upgrade to the version 1.31.1. I will let you know if we still face backup issues with velero afterwards.
Glad to hear you could fix this on your own!
However, I am a bit surprised that a res_tmp should cause this issue. To my understanding drbdadm is explicitly only scanning *.res files, not *.res_tmp. The latter file is usually created by LINSTOR in order to check the generated file before replacing the corresponding .res file with it. For that LINSTOR runs commands like
drbdadm --config-to-test /var/lib/linstor.d/rsc.res_tmp --config-to-exclude /var/lib/linstor.d/rsc.res sh-nop
In other words, this "config-to-test" needs to explicitly ignore the "real" [...]/rsc.res file and needs to be given explicitly the [...]/rsc.res_tmp file to test. Otherwise I'd expect drbdadm to happily ignore all files that do not match *.res...
Which means, that my best guess would still be some orphaned .res file still occupying the TCP port - which should have been cleaned up by restarting the satellite.
We will make a few changes in the near future, some of the regarding the TCP port handling itself, others regarding the .res_tmp files, so this issue might be addressed in future releases even more (hopefully in a positive way 😁 )
Hello, after going through #451, and "fixing" it by deleting the problematic linstor CRDs, I am now experiencing this exact same issue, down to ID:
ERROR REPORT 684A99A9-00000-005154
============================================================
Application: LINBIT? LINSTOR
Module: Controller
Version: 1.31.0
Build ID: a187af5c85a96bb27df87a5eab0bcf9dd6de6a34
Build time: 2025-04-08T07:51:57+00:00
Error time: 2025-06-28 08:11:59
Node: linstor-controller-7496884d58-2xthr
Thread: grizzly-http-server-24
============================================================
Reported error:
===============
Category: RuntimeException
Class name: ApiTransactionException
Class canonical name: com.linbit.linstor.core.apicallhandler.response.ApiTransactionException
Generated at: Method 'commit', Source file 'CtrlTransactionHelper.java', Line #30
Error message: com.linbit.linstor.transaction.TransactionException: Error creating rollback entry
Call backtrace:
Method Native Class:Line number
commit N com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:30
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
lambda$modifyKVStore$2 N com.linbit.linstor.api.rest.v1.KeyValueStore:104
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:235
modifyKVStore N com.linbit.linstor.api.rest.v1.KeyValueStore:97
invoke N jdk.internal.reflect.GeneratedMethodAccessor232:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:569
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:478
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:400
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:81
run N org.glassfish.jersey.server.ServerRuntime$1:256
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:235
handle N org.glassfish.jersey.server.ApplicationHandler:684
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:190
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
run N java.lang.Thread:840
Caused by:
==========
Category: RuntimeException
Class name: TransactionException
Class canonical name: com.linbit.linstor.transaction.TransactionException
Generated at: Method 'commit', Source file 'ControllerK8sCrdTransactionMgr.java', Line #157
Error message: Error creating rollback entry
ErrorContext:
Call backtrace:
Method Native Class:Line number
commit N com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:157
commit N com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
lambda$modifyKVStore$2 N com.linbit.linstor.api.rest.v1.KeyValueStore:104
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:235
modifyKVStore N com.linbit.linstor.api.rest.v1.KeyValueStore:97
invoke N jdk.internal.reflect.GeneratedMethodAccessor232:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:569
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:478
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:400
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:81
run N org.glassfish.jersey.server.ServerRuntime$1:256
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:235
handle N org.glassfish.jersey.server.ApplicationHandler:684
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:190
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
run N java.lang.Thread:840
Caused by:
==========
Category: LinStorException
Class name: DatabaseException
Class canonical name: com.linbit.linstor.dbdrivers.DatabaseException
Generated at: Method 'createRollbackEntry', Source file 'ControllerK8sCrdRollbackMgr.java', Line #84
Error message: Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not found
ErrorContext:
Call backtrace:
Method Native Class:Line number
createRollbackEntry N com.linbit.linstor.transaction.ControllerK8sCrdRollbackMgr:84
commit N com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:152
commit N com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
lambda$modifyKVStore$2 N com.linbit.linstor.api.rest.v1.KeyValueStore:104
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:235
modifyKVStore N com.linbit.linstor.api.rest.v1.KeyValueStore:97
invoke N jdk.internal.reflect.GeneratedMethodAccessor232:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:569
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:478
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:400
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:81
run N org.glassfish.jersey.server.ServerRuntime$1:256
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:235
handle N org.glassfish.jersey.server.ApplicationHandler:684
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:190
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
run N java.lang.Thread:840
END OF ERROR REPORT.
I wanted to delete the velero namespace and all backup data because I do not have time to deal with this and my prod can live without backups for a bit, but unfortunately I am seeing that k linstor s l returns all snapshots stuck in DELETING. I think I can solve this by deleting all snapshots and starting over... @ghernadi any idea how to do that ? Right now all I have is the error-reports above and all snapshots stuck in DELETING.
How do I get out of this in a clean way ? I just want to start over (without deleting any actual PVC data)
Sorry for the late response. That error message do you get if you manually try to delete the snapshots?
If you are still getting the same issue as in the original post here, you should also be able to fix that as I mentioned in https://github.com/LINBIT/linstor-server/issues/454#issuecomment-2929114454. However, as also mentioned a few comments later, you will need to restart the controller for this to take effect. But (as also mentioned earlier) restarting the controller might end up in the controller not properly starting. Please read the comments above for more details.
Hello, Sorry for the late response @ghernadi :
4 weeks ago I solved this by nuking the entire snapshot environment, velero namespace along with all snapshots and associated dependencies in linstor. Updated linstor/DRBD then re-installed latest velero and started observing.
Today the issue randomly popped up again :
- 1 minute after a scheduled velero backup starts, this pops up :
k linstor error-reports show 68688EE0-00000-000013
ERROR REPORT 68688EE0-00000-000013
============================================================
Application: LINBIT? LINSTOR
Module: Controller
Version: 1.31.0
Build ID: a187af5c85a96bb27df87a5eab0bcf9dd6de6a34
Build time: 2025-04-08T07:51:57+00:00
Error time: 2025-07-29 02:31:12
Node: linstor-controller-7496884d58-9tx9x
Thread: grizzly-http-server-43
============================================================
Reported error:
===============
Category: RuntimeException
Class name: ConcurrentModificationException
Class canonical name: java.util.ConcurrentModificationException
Generated at: Method 'nextEntry', Source file 'TreeMap.java', Line #1486
Call backtrace:
Method Native Class:Line number
nextEntry N java.util.TreeMap$PrivateEntryIterator:1486
next N java.util.TreeMap$EntryIterator:1522
next N java.util.TreeMap$EntryIterator:1517
hashCode N java.util.AbstractMap:525
hashCode N com.linbit.linstor.transaction.TransactionMap:274
hash N java.util.HashMap:338
put N java.util.HashMap:610
add N java.util.HashSet:221
register N com.linbit.linstor.transaction.TransactionObjectCollection:29
register N com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:138
setConnection N com.linbit.linstor.transaction.AbsTransactionObject:82
activateTransMgr N com.linbit.linstor.transaction.AbsTransactionObject:159
cache N com.linbit.linstor.transaction.TransactionMap:195
put N com.linbit.linstor.transaction.TransactionMap:138
getInstance N com.linbit.linstor.security.ObjectProtectionFactory:85
create N com.linbit.linstor.core.objects.KeyValueStoreControllerFactory:64
create N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:80
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:165
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
lambda$modifyKVStore$2 N com.linbit.linstor.api.rest.v1.KeyValueStore:104
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:235
modifyKVStore N com.linbit.linstor.api.rest.v1.KeyValueStore:97
invoke N jdk.internal.reflect.GeneratedMethodAccessor209:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:569
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:478
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:400
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:81
run N org.glassfish.jersey.server.ServerRuntime$1:256
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:235
handle N org.glassfish.jersey.server.ApplicationHandler:684
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:190
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
run N java.lang.Thread:840
END OF ERROR REPORT.
Then the velero backup hangs.
- The above is followed by a loop of the following, in
k logs linstor-csi-controller:
time="2025-07-30T09:24:45Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[68688EE0-00000-000432]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-3e656054-cd45-4968-9b7f-f34d22e53d82\" name:\"snapshot-b0d85fbb-85a8-4eae-9690-9b02966c587a\"" resp="<nil>" version=v1.7.1-667d213c3653ff1262bc8d0b95be898fe47babeb
printing ad infinitum. These error reports themselves are basically a loop of the original issue as posted above :
k linstor error-reports show 68688EE0-00000-000428
ERROR REPORT 68688EE0-00000-000428
============================================================
Application: LINBIT? LINSTOR
Module: Controller
Version: 1.31.0
Build ID: a187af5c85a96bb27df87a5eab0bcf9dd6de6a34
Build time: 2025-04-08T07:51:57+00:00
Error time: 2025-07-30 09:04:45
Node: linstor-controller-7496884d58-9tx9x
Thread: grizzly-http-server-46
============================================================
Reported error:
===============
Category: RuntimeException
Class name: ApiTransactionException
Class canonical name: com.linbit.linstor.core.apicallhandler.response.ApiTransactionException
Generated at: Method 'commit', Source file 'CtrlTransactionHelper.java', Line #30
Error message: com.linbit.linstor.transaction.TransactionException: Error creating rollback entry
Call backtrace:
Method Native Class:Line number
commit N com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:30
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
lambda$modifyKVStore$2 N com.linbit.linstor.api.rest.v1.KeyValueStore:104
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:235
modifyKVStore N com.linbit.linstor.api.rest.v1.KeyValueStore:97
invoke N jdk.internal.reflect.GeneratedMethodAccessor209:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:569
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:478
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:400
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:81
run N org.glassfish.jersey.server.ServerRuntime$1:256
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:235
handle N org.glassfish.jersey.server.ApplicationHandler:684
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:190
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
run N java.lang.Thread:840
Caused by:
==========
Category: RuntimeException
Class name: TransactionException
Class canonical name: com.linbit.linstor.transaction.TransactionException
Generated at: Method 'commit', Source file 'ControllerK8sCrdTransactionMgr.java', Line #157
Error message: Error creating rollback entry
ErrorContext:
Call backtrace:
Method Native Class:Line number
commit N com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:157
commit N com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
lambda$modifyKVStore$2 N com.linbit.linstor.api.rest.v1.KeyValueStore:104
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:235
modifyKVStore N com.linbit.linstor.api.rest.v1.KeyValueStore:97
invoke N jdk.internal.reflect.GeneratedMethodAccessor209:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:569
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:478
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:400
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:81
run N org.glassfish.jersey.server.ServerRuntime$1:256
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:235
handle N org.glassfish.jersey.server.ApplicationHandler:684
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:190
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
run N java.lang.Thread:840
Caused by:
==========
Category: LinStorException
Class name: DatabaseException
Class canonical name: com.linbit.linstor.dbdrivers.DatabaseException
Generated at: Method 'createRollbackEntry', Source file 'ControllerK8sCrdRollbackMgr.java', Line #84
Error message: Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not found
ErrorContext:
Call backtrace:
Method Native Class:Line number
createRollbackEntry N com.linbit.linstor.transaction.ControllerK8sCrdRollbackMgr:84
commit N com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:152
commit N com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
modifyKvs N com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
lambda$modifyKVStore$2 N com.linbit.linstor.api.rest.v1.KeyValueStore:104
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:235
modifyKVStore N com.linbit.linstor.api.rest.v1.KeyValueStore:97
invoke N jdk.internal.reflect.GeneratedMethodAccessor209:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:569
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:478
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:400
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:81
run N org.glassfish.jersey.server.ServerRuntime$1:256
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:235
handle N org.glassfish.jersey.server.ApplicationHandler:684
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:190
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
run N java.lang.Thread:840
END OF ERROR REPORT.
- Finding the problematic
volumesnapshotresource and describing it :
k describe volumesnapshot -A
Name: velero-sonarr-config-f6gvc
Namespace: k8s-mediaserver
Labels: app.kubernetes.io/managed-by=Helm
velero.io/backup-name=velero-daily-20250729023013
Annotations: <none>
API Version: snapshot.storage.k8s.io/v1
Kind: VolumeSnapshot
Metadata:
Creation Timestamp: 2025-07-30T08:31:13Z
Finalizers:
snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
Generate Name: velero-sonarr-config-
Generation: 1
Resource Version: 224272726
UID: b0d85fbb-85a8-4eae-9690-9b02966c587a
Spec:
Source:
Persistent Volume Claim Name: sonarr-config
Volume Snapshot Class Name: linstor-snapshot
Status:
Bound Volume Snapshot Content Name: snapcontent-b0d85fbb-85a8-4eae-9690-9b02966c587a
Error:
Message: Failed to check and update snapshot content: failed to take snapshot of the volume pvc-3e656054-cd45-4968-9b7f-f34d22e53d82: "rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[68688EE0-00000-000438]'"
Time: 2025-07-30T09:28:26Z
Ready To Use: false
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatingSnapshot 57m snapshot-controller Waiting for a snapshot k8s-mediaserver/velero-sonarr-config-f6gvc to be created by the CSI driver.
- checking for issues with the corresponding PVC (a non-replicated volume on a single node):
k linstor v l | grep d82
| pvc-3e656054-cd45-4968-9b7f-f34d22e53d82 | n3 | hdd-thin | 0 | 1015 | /dev/drbd1015 | 701.29 MiB | InUse | UpToDate |
k exec -n piraeus-datastore daemonsets/linstor-satellite.n3 -c linstor-satellite -- xfs_repair -n /dev/drbd1015 (⎈|sekops-omni-p0:default)
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used. Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
- scan filesystem freespace and inode maps...
sb_ifree 171, counted 173
sb_fdblocks 2404043, counted 2430293
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 2
- agno = 3
- agno = 1
Phase 6 - check inode connectivity...
- traversing filesystem ...
No modify flag set, skipping phase 5
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
command terminated with exit code 1
seems something might be failing silently with this filesystem... the CSI and controller seem to mount it fine though. I scaled the workload to 0 then
k exec -n piraeus-datastore daemonsets/linstor-satellite.n3 -c linstor-satellite -- xfs_repair -L /dev/drbd1015 (⎈|sekops-omni-p0:default)
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 2
- agno = 1
- agno = 3
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (385:45549) is ahead of log (1:2).
Format log to cycle 388.
done
But that didnt solve the issue
- Next I went and manually deleted the problematic
volumesnapshotAND the snapshot linstor resource:
k linstor s d pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8 snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725 (⎈|sekops-omni-p0:default)
SUCCESS:
Marked snapshot for deletion snapshot 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of resource 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8' on nodes 'n3'
SUCCESS:
(n3) Snapshot [ZFS-Thin] with name 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of resource 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8', volume number 0 deleted.
SUCCESS:
Deleted snapshot 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8' on 'n3'
SUCCESS:
Snapshot 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of resource 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8' on nodes 'n3' deleted.
SUCCESS:
Description:
Snapshot 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of resource 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8' deleted.
Details:
Snapshot 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of resource 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8' UUID was: 1a36fa9b-311e-4a42-b2a3-f33ff4dc0681
This resoved the issue for this part of the backup, The controller immediately went into the same error for the next volume to snapshot in the list :
k describe volumesnapshot -A (⎈|sekops-omni-p0:default)
Name: velero-mail-mail-0-47z48
Namespace: mail
Labels: app.kubernetes.io/instance=mail
app.kubernetes.io/name=mail
velero.io/backup-name=velero-daily-20250729023013
velero.io/restore-name=web-20240909184447
velero.io/volume-snapshot-name=velero-mail-mail-0-8vgkj
Annotations: <none>
API Version: snapshot.storage.k8s.io/v1
Kind: VolumeSnapshot
Metadata:
Creation Timestamp: 2025-07-30T09:40:53Z
Finalizers:
snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
Generate Name: velero-mail-mail-0-
Generation: 1
Resource Version: 224280384
UID: 37247916-46f7-4cb9-8f0d-360df19d05e1
Spec:
Source:
Persistent Volume Claim Name: mail-mail-0
Volume Snapshot Class Name: linstor-snapshot
Status:
Bound Volume Snapshot Content Name: snapcontent-37247916-46f7-4cb9-8f0d-360df19d05e1
Error:
Message: Failed to check and update snapshot content: failed to take snapshot of the volume pvc-6f95699e-88a8-44d9-8c22-2250ac7179b6: "rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[68688EE0-00000-000451]'"
Time: 2025-07-30T09:43:01Z
Ready To Use: false
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatingSnapshot 3m11s snapshot-controller Waiting for a snapshot mail/velero-mail-mail-0-47z48 to be created by the CSI driver.
Which will continue in a loop until the velero backup is nuked (And even then it might not be able to exit cleanly in which case you have to nuke the velero entirely) and start over. Once the velero backup is failed. The error-reports stop pouring in since the controller stopped the loop.
- Not sure if this is relevant : This happens after a node goes into a "lost quorum" loop, because one of the replicated volumes had some heavy writing overninght (>500GB). And this caused the volume to split brain. This volume is not included in the velero backups (no snapshot of this one), but since the volume is split brain, the entire node goes into a loop of
lost quorum: unschedulable->quorum recovered: schedulable. I suspect this issue is linked to the OP issue. Had to manually recover from that this morning before recovering from the failed snapshotter.
@ghernadi The problem here is that we dont want to have a quickfix, since this will trigger randomly and cascade into breaking the entire backup system. We need a fix. Somehing that prevents the snapshot controller from going into this failure state. Either by finding the root cause (quorum lost ? --> snapshotter fails) or by disabling the snapshotter until it is fixed (with some update to the documentation).
Hope this helps you figure out why this is happening. Thanks
Sorry, I am not ignoring this issue, it rather gives me quite some headache..
I am quite sure that step 7 has nothing to do with this scenario.
Step 1 does show new information for me, true, but I'm afraid that is still not enough. To be honest, I am not even sure how the ErrorReport in step 1 can even happen, since it requires multiple threads to work on the same object but I do not see where this object could be reached from more than one thread... Also I do not see a connection when the error from step 1 should cause the error in step 2. All my attempts to forcefully reproduce step 1 (by manually injecting an error state at your given stack trace), I still fail to trigger step 2 which would be the major cause of the error of this whole scenario...
I will still try to reproduce and understand the two ErrorReports, but if you could setup a second cluster (I do not want to clutter your production of course!) where you can reproduce this issue, I'd like to ask you to do the following:
- Turn on TRACE logging via
linstor c set-log-level TRACE - Do whatever you need to do to trigger the error shown in step 2 (the first occurrence of
Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not foundis already enough) - Turn TRACE logging off via
linstor c set-log-level INFO - Run
linstor sos dl
Afterwards either throw the cluster away if this was just a test-cluster or do your cleanup, and send me the resulting SOS report to the email address in my profile here.
I assume there are multiple things happening concurrently either in LINSTOR or outside (or both) which are simply not covered by the ErrorReports but should be covered by the TRACE logging. Also the SOS report should collect all ErrorReports so in case there are more relevant ErrorReports I should get those too.