linstor-server icon indicating copy to clipboard operation
linstor-server copied to clipboard

Error creating rollback entry - Resource SecObjectProtection not found

Open ksyblast opened this issue 7 months ago • 16 comments

hello!

Urgent help needed!

OS: Oracle Linux 8 Kernel: UEK 5.15.0-308.179.6.el8uek.x86_64 piraeus-operator: v2.8.0 velero: v1.15.2

Got an error during backup of ANY volume:

time="2025-05-29T15:24:24Z" level=warning msg="VolumeSnapshotContent snapcontent-5fefd56c-f43b-45cf-8998-848f9b119661 has error: Failed to check and update snapshot content: failed to take snapshot of the volume pvc-e60b0f01-dfe4-4237-b4aa-b59aa70828b6: \"rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[6822F3BD-00000-001896]'\"" backup=velero/ksy-20250529-02 cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:744" pluginName=velero

ERROR REPORT 6822F3BD-00000-001896

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.30.4
Build ID:                           bef74a44609cb592c5efad2e707b50e696623c61
Build time:                         2025-02-03T15:48:28+00:00
Error time:                         2025-05-29 15:23:46
Node:                               linstor-controller-5dd47c5b84-7rftt
Thread:                             grizzly-http-server-8

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         ApiTransactionException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiTransactionException
Generated at:                       Method 'commit', Source file 'CtrlTransactionHelper.java', Line #30

Error message:                      com.linbit.linstor.transaction.TransactionException: Error creating rollback entry

Call backtrace:

    Method                                   Native Class:Line number
    commit                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:30
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
    lambda$modifyKVStore$2                   N      com.linbit.linstor.api.rest.v1.KeyValueStore:104
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:235
    modifyKVStore                            N      com.linbit.linstor.api.rest.v1.KeyValueStore:97
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor204:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:569
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:840

Caused by:
==========

Category:                           RuntimeException
Class name:                         TransactionException
Class canonical name:               com.linbit.linstor.transaction.TransactionException
Generated at:                       Method 'commit', Source file 'ControllerK8sCrdTransactionMgr.java', Line #157

Error message:                      Error creating rollback entry

ErrorContext: 

Call backtrace:

    Method                                   Native Class:Line number
    commit                                   N      com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:157
    commit                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
    lambda$modifyKVStore$2                   N      com.linbit.linstor.api.rest.v1.KeyValueStore:104
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:235
    modifyKVStore                            N      com.linbit.linstor.api.rest.v1.KeyValueStore:97
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor204:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:569
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:840

Caused by:
==========

Category:                           LinStorException
Class name:                         DatabaseException
Class canonical name:               com.linbit.linstor.dbdrivers.DatabaseException
Generated at:                       Method 'createRollbackEntry', Source file 'ControllerK8sCrdRollbackMgr.java', Line #84

Error message:                      Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not found

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    createRollbackEntry                      N      com.linbit.linstor.transaction.ControllerK8sCrdRollbackMgr:84
    commit                                   N      com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:152
    commit                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
    lambda$modifyKVStore$2                   N      com.linbit.linstor.api.rest.v1.KeyValueStore:104
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:235
    modifyKVStore                            N      com.linbit.linstor.api.rest.v1.KeyValueStore:97
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor204:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:569
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:840


END OF ERROR REPORT.


I have checked here it's recommended to create the problematic resource, but with which contents should I do it? I don't find any mentions of this resource in etcd and it's triggered for ANY volume we are trying to backup with the same SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686.

ksyblast avatar May 29 '25 15:05 ksyblast

Hello, sorry for the delayed response.

Try this content:

apiVersion: "internal.linstor.linbit.com/v1-27-1"
kind: "SecObjectProtection"
metadata:
  name: "b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686"
spec:
  object_path: "/keyvaluestores/CSI-BACKUP-MAPPING"
  creator_identity_name: "PUBLIC"
  owner_role_name: "PUBLIC"
  security_type_name: "PUBLIC"

If there are also other entries missing, I'd like to ask you to send a k8s dump to my email so I can see if I can fix this for you.

ghernadi avatar Jun 02 '25 07:06 ghernadi

Hello and thanks for the answer! Unfortunately the suggested fix did not help. I have created the object but still get the same error

time="2025-06-02T14:31:23Z" level=info msg="Created VolumeSnapshot namespace/velero-test-pvc-xfs-wgpb6" backup=velero/ksy-20250602-02 cmd=/velero logSource="pkg/backup/actions/csi/pvc_action.go:187" pluginName=velero
time="2025-06-02T14:31:23Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot namespace/velero-test-pvc-xfs-wgpb6. Retrying in 5s" backup=velero/ksy-20250602-02 cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:712" pluginName=velero
time="2025-06-02T14:31:28Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-fe1ccd23-40fa-46f9-91ea-888955e6f96d to have snapshot handle. Retrying in 5s" backup=velero/ksy-20250602-02 cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2025-06-02T14:31:28Z" level=warning msg="VolumeSnapshotContent snapcontent-fe1ccd23-40fa-46f9-91ea-888955e6f96d has error: Failed to check and update snapshot content: failed to take snapshot of the volume pvc-e60b0f01-dfe4-4237-b4aa-b59aa70828b6: \"rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[6822F3BD-00000-002865]'\"" backup=velero/ksy-20250602-02 cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:744" pluginName=velero

 kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor error-reports show 6822F3BD-00000-002865
...
Category:                           LinStorException
Class name:                         DatabaseException
Class canonical name:               com.linbit.linstor.dbdrivers.DatabaseException
Generated at:                       Method 'createRollbackEntry', Source file 'ControllerK8sCrdRollbackMgr.java', Line #84

Error message:                      Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not found

ErrorContext:

...
kubectl get secobjectprotection.internal.linstor.linbit.com/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 -o yaml 
apiVersion: internal.linstor.linbit.com/v1-15-0
kind: SecObjectProtection
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"internal.linstor.linbit.com/v1-15-0","kind":"SecObjectProtection","metadata":{"annotations":{},"name":"b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686"},"spec":{"creator_identity_name":"PUBLIC","object_path":"/keyvaluestores/CSI-BACKUP-MAPPING","owner_role_name":"PUBLIC","security_type_name":"PUBLIC"}}
  creationTimestamp: "2025-06-02T14:31:01Z"
  generation: 1
  name: b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686
  resourceVersion: "2065499381"
  uid: 5f038f8b-e159-447c-8151-db3c8aa56bd0
spec:
  creator_identity_name: PUBLIC
  object_path: /keyvaluestores/CSI-BACKUP-MAPPING
  owner_role_name: PUBLIC
  security_type_name: PUBLIC

ksyblast avatar Jun 02 '25 14:06 ksyblast

Sorry, I forgot that LINSTOR is internally caching a few things. That would at least explain the discrepancy between LINSTOR'S "not found" vs kubectl get ... "exists" status.

However, restarting the controller could lead to the problem that if the database is not consistent right now, the controller might not start properly. So either try to clone your k8s database to some playground project and try to start the controller there, or just send me a dump of your k8s database to my email address so I can verify the database consistency.

If you want to try it yourself, please make sure that the dummy controller cannot reach the actual satellites to avoid unnecessary reconnects and OFFLINE(Other controller) status messages from the actual controller.

If it turns out that there are other issues in the database, I guess you will need to send me the database anyways, since it can be quite cumbersome to figure out what is missing.

ghernadi avatar Jun 02 '25 14:06 ghernadi

Yes I'm also afraid to restart the controller. We will discuss it in our team and I will try what you suggested or send you the dump later on, thank you again for your help.

ksyblast avatar Jun 02 '25 15:06 ksyblast

Hi! The database was OK so we restarted the controller, the error disappeared but now there's a new error:

2025-06-05 05:55:59.231 [DeviceManager] ERROR LINSTOR/Satellite/2fcb4d SYSTEM - Generated resource file for resource 'pvc-918df6b7-f2ac-4edb-8a3c-0fd5fe8e5e19' is invalid. [Report number 68406749-5C801-000023]```
The external command sent the following error information:
/var/lib/linstor.d/pvc-918df6b7-f2ac-4edb-8a3c-0fd5fe8e5e19.res_tmp:103: in resource pvc-918df6b7-f2ac-4edb-8a3c-0fd5fe8e5e19
    ipv4:10.1.1.18:7023 is also used /var/lib/linstor.d/pvc-3279d999-4a6d-4c8e-bf6b-2536f091b8ab.res:71 (resource pvc-3279d999-4a6d-4c8e-bf6b-2536f091b8ab)
/var/lib/linstor.d/pvc-918df6b7-f2ac-4edb-8a3c-0fd5fe8e5e19.res_tmp:104: in resource pvc-918df6b7-f2ac-4edb-8a3c-0fd5fe8e5e19
    ipv4:10.1.1.15:7023 is also used /var/lib/linstor.d/pvc-3279d999-4a6d-4c8e-bf6b-2536f091b8ab.res:72 (resource pvc-3279d999-4a6d-4c8e-bf6b-2536f091b8ab)
command sh-nop exited with code 10

Both PVC-s were created during velero backup.

ksyblast avatar Jun 05 '25 16:06 ksyblast

Can you check if both PVCs still exist in linstor rd l and also if both really have the same TCP port? If so, I guess we will need to fix this in the DB, but I'd be curious how you ended up with duplicated TCP ports..

ghernadi avatar Jun 05 '25 20:06 ghernadi

We have already fixed it I hope.

I can't say exactly if there were both resources in 'r l' but definitely at least one of them was hanging in Connecting state. When the backup was still running I could see that these both PVCs were temporary PVCs created by velero. When the backup was finished the resource was still in Connecting state and there were 3 rd in the database not belonging to any PVC. The problematics PVCs were no longer listed in the API. We deleted the dangling rd-s manually, one of them was failing until we found and also deleted the corresponding PV which still was seen in K8s API. Now the number of rd-s and resources corresponds to the number of PVCs.

Maybe I will also recheck in the database if something related to them is still left.

P.S. Once or twice we already had this issue with a duplicate port after backups, we had to find all the corresponding objects in the db and delete them manually.

ksyblast avatar Jun 06 '25 07:06 ksyblast

Deleting entries from the database manually is always a bit risky since you can overlook some entries and end up with orphaned data. The issue here is not just cluttering the database but also new entries might accidentally relate to those entries, making it nearly impossible to figure out how one got into this state.

This is why I usually try to figure out if there is really the same port in the database itself or if we only have some orphaned .res file laying around and drbdadm finds that when checking the validity of .res files. In the latter case cleaning up the .res file should be preferred than trying to manipulate the database manually.

ghernadi avatar Jun 06 '25 08:06 ghernadi

Thank you @ghernadi . So could you please confirm that the most appropriate action in case we face the situation again is to delete the problematic .res file from the satellite? Could you please give some recommendations how we can trace the same port re-use scenario? I can collect the logs if needed.

The last backup partially failed with the error we have already faced: https://github.com/piraeusdatastore/piraeus-operator/issues/824

ksyblast avatar Jun 09 '25 07:06 ksyblast

Yes. To my understanding there are just 2 possibilities how you can end up in a duplicated TCP port scenario (or duplicated minor number).

  1. LINSTOR made a mistake and the same TCP port is actually in the database. You should be able to check this using rd l and check for duplicated ports (or in future this will be moved to r l --show-drbd-ports or something like that)
  2. There is some other .res file lying around on your satellite that LINSTOR does not know about (or maybe "no longer knows about"). When running any drbdadm command, drbdadm scans all available .res files and runs some uniqueness checks, for example for TCP ports, minor numbers, etc... That means drbdadm does indeed take .res files into account that are not known by LINSTOR, maybe not even managed by LINSTOR, so that drbdadm can complain about duplicated settings.

In the first case, there is not much you can do besides trying to delete the resource if the controller is still up and running. If the controller is already offline, it may not start properly due to the duplicated entry. In this case you will need to clean up the DB itself (which is quite cumbersome as mentioned earlier)

In the latter case, you are correct with your assumption. You can try restart the satellite (not the controller). When the satellite starts, it first deletes all .res files in /var/lib/linstor.d (unless LS_KEEP_RES is used). Once it receives the data from the controller, the satellite rewrites all needed .res files and checks if the DRBD resources need to be adjusted.

Now that I have mentioned this orphaned .res file scenario: In 1.31.1 we released a fix for a bug where if you had a DRBD resource manually downed (drbdadm down $rsc) and deleted the LINSTOR resource while the corresponding DRBD resource was down, LINSTOR did not properly delete the .res file. LINSTOR would in this case re-use the TCP port for the next created resource since the previous resource was deleted from the LINSTOR database. The still remaining orphaned .res file on the satellite could cause the issue described in 2.

So if it is possible that you had such an incident, you might want to upgrade LINSTOR to avoid this in future.

ghernadi avatar Jun 10 '25 06:06 ghernadi

Today we have faced the situation with the duplicate port again after velero backup. I can confirm the duplicate port was not in rd l. There was only one record with the mentioned port in rd l and it was some dangling velero PVC left after the backup, the PVC was no longer in the cluster, only the PV was left.

I checked the .res files at the satellites and there were some res_tmp files using the same port. PVC-s corresponding to these resources were no longer in the cluster. I made some combination of deleting the res_tmp resources, restarting the satellites and so on and finally the issue was resolved.

We are planing the upgrade to the version 1.31.1. I will let you know if we still face backup issues with velero afterwards.

ksyblast avatar Jun 10 '25 19:06 ksyblast

Glad to hear you could fix this on your own!

However, I am a bit surprised that a res_tmp should cause this issue. To my understanding drbdadm is explicitly only scanning *.res files, not *.res_tmp. The latter file is usually created by LINSTOR in order to check the generated file before replacing the corresponding .res file with it. For that LINSTOR runs commands like

drbdadm --config-to-test /var/lib/linstor.d/rsc.res_tmp --config-to-exclude /var/lib/linstor.d/rsc.res sh-nop

In other words, this "config-to-test" needs to explicitly ignore the "real" [...]/rsc.res file and needs to be given explicitly the [...]/rsc.res_tmp file to test. Otherwise I'd expect drbdadm to happily ignore all files that do not match *.res...

Which means, that my best guess would still be some orphaned .res file still occupying the TCP port - which should have been cleaned up by restarting the satellite.

We will make a few changes in the near future, some of the regarding the TCP port handling itself, others regarding the .res_tmp files, so this issue might be addressed in future releases even more (hopefully in a positive way 😁 )

ghernadi avatar Jun 10 '25 19:06 ghernadi

Hello, after going through #451, and "fixing" it by deleting the problematic linstor CRDs, I am now experiencing this exact same issue, down to ID:

ERROR REPORT 684A99A9-00000-005154

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.31.0
Build ID:                           a187af5c85a96bb27df87a5eab0bcf9dd6de6a34
Build time:                         2025-04-08T07:51:57+00:00
Error time:                         2025-06-28 08:11:59
Node:                               linstor-controller-7496884d58-2xthr
Thread:                             grizzly-http-server-24

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         ApiTransactionException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiTransactionException
Generated at:                       Method 'commit', Source file 'CtrlTransactionHelper.java', Line #30

Error message:                      com.linbit.linstor.transaction.TransactionException: Error creating rollback entry

Call backtrace:

    Method                                   Native Class:Line number
    commit                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:30
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
    lambda$modifyKVStore$2                   N      com.linbit.linstor.api.rest.v1.KeyValueStore:104
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:235
    modifyKVStore                            N      com.linbit.linstor.api.rest.v1.KeyValueStore:97
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor232:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:569
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:840

Caused by:
==========

Category:                           RuntimeException
Class name:                         TransactionException
Class canonical name:               com.linbit.linstor.transaction.TransactionException
Generated at:                       Method 'commit', Source file 'ControllerK8sCrdTransactionMgr.java', Line #157

Error message:                      Error creating rollback entry

ErrorContext: 

Call backtrace:

    Method                                   Native Class:Line number
    commit                                   N      com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:157
    commit                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
    lambda$modifyKVStore$2                   N      com.linbit.linstor.api.rest.v1.KeyValueStore:104
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:235
    modifyKVStore                            N      com.linbit.linstor.api.rest.v1.KeyValueStore:97
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor232:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:569
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:840

Caused by:
==========

Category:                           LinStorException
Class name:                         DatabaseException
Class canonical name:               com.linbit.linstor.dbdrivers.DatabaseException
Generated at:                       Method 'createRollbackEntry', Source file 'ControllerK8sCrdRollbackMgr.java', Line #84

Error message:                      Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not found

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    createRollbackEntry                      N      com.linbit.linstor.transaction.ControllerK8sCrdRollbackMgr:84
    commit                                   N      com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:152
    commit                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
    lambda$modifyKVStore$2                   N      com.linbit.linstor.api.rest.v1.KeyValueStore:104
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:235
    modifyKVStore                            N      com.linbit.linstor.api.rest.v1.KeyValueStore:97
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor232:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:569
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:840


END OF ERROR REPORT.


I wanted to delete the velero namespace and all backup data because I do not have time to deal with this and my prod can live without backups for a bit, but unfortunately I am seeing that k linstor s l returns all snapshots stuck in DELETING. I think I can solve this by deleting all snapshots and starting over... @ghernadi any idea how to do that ? Right now all I have is the error-reports above and all snapshots stuck in DELETING.

How do I get out of this in a clean way ? I just want to start over (without deleting any actual PVC data)

bernardgut avatar Jun 28 '25 08:06 bernardgut

Sorry for the late response. That error message do you get if you manually try to delete the snapshots?

If you are still getting the same issue as in the original post here, you should also be able to fix that as I mentioned in https://github.com/LINBIT/linstor-server/issues/454#issuecomment-2929114454. However, as also mentioned a few comments later, you will need to restart the controller for this to take effect. But (as also mentioned earlier) restarting the controller might end up in the controller not properly starting. Please read the comments above for more details.

ghernadi avatar Jul 07 '25 08:07 ghernadi

Hello, Sorry for the late response @ghernadi :

4 weeks ago I solved this by nuking the entire snapshot environment, velero namespace along with all snapshots and associated dependencies in linstor. Updated linstor/DRBD then re-installed latest velero and started observing.

Today the issue randomly popped up again :

  1. 1 minute after a scheduled velero backup starts, this pops up :
k linstor error-reports show 68688EE0-00000-000013                                                                                          
ERROR REPORT 68688EE0-00000-000013

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.31.0
Build ID:                           a187af5c85a96bb27df87a5eab0bcf9dd6de6a34
Build time:                         2025-04-08T07:51:57+00:00
Error time:                         2025-07-29 02:31:12
Node:                               linstor-controller-7496884d58-9tx9x
Thread:                             grizzly-http-server-43

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         ConcurrentModificationException
Class canonical name:               java.util.ConcurrentModificationException
Generated at:                       Method 'nextEntry', Source file 'TreeMap.java', Line #1486


Call backtrace:

    Method                                   Native Class:Line number
    nextEntry                                N      java.util.TreeMap$PrivateEntryIterator:1486
    next                                     N      java.util.TreeMap$EntryIterator:1522
    next                                     N      java.util.TreeMap$EntryIterator:1517
    hashCode                                 N      java.util.AbstractMap:525
    hashCode                                 N      com.linbit.linstor.transaction.TransactionMap:274
    hash                                     N      java.util.HashMap:338
    put                                      N      java.util.HashMap:610
    add                                      N      java.util.HashSet:221
    register                                 N      com.linbit.linstor.transaction.TransactionObjectCollection:29
    register                                 N      com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:138
    setConnection                            N      com.linbit.linstor.transaction.AbsTransactionObject:82
    activateTransMgr                         N      com.linbit.linstor.transaction.AbsTransactionObject:159
    cache                                    N      com.linbit.linstor.transaction.TransactionMap:195
    put                                      N      com.linbit.linstor.transaction.TransactionMap:138
    getInstance                              N      com.linbit.linstor.security.ObjectProtectionFactory:85
    create                                   N      com.linbit.linstor.core.objects.KeyValueStoreControllerFactory:64
    create                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:80
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:165
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
    lambda$modifyKVStore$2                   N      com.linbit.linstor.api.rest.v1.KeyValueStore:104
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:235
    modifyKVStore                            N      com.linbit.linstor.api.rest.v1.KeyValueStore:97
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor209:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:569
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:840


END OF ERROR REPORT.

Then the velero backup hangs.

  1. The above is followed by a loop of the following, in k logs linstor-csi-controller:
time="2025-07-30T09:24:45Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[68688EE0-00000-000432]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-3e656054-cd45-4968-9b7f-f34d22e53d82\" name:\"snapshot-b0d85fbb-85a8-4eae-9690-9b02966c587a\"" resp="<nil>" version=v1.7.1-667d213c3653ff1262bc8d0b95be898fe47babeb

printing ad infinitum. These error reports themselves are basically a loop of the original issue as posted above :

k linstor error-reports show 68688EE0-00000-000428                                                                                        
ERROR REPORT 68688EE0-00000-000428

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.31.0
Build ID:                           a187af5c85a96bb27df87a5eab0bcf9dd6de6a34
Build time:                         2025-04-08T07:51:57+00:00
Error time:                         2025-07-30 09:04:45
Node:                               linstor-controller-7496884d58-9tx9x
Thread:                             grizzly-http-server-46

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         ApiTransactionException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiTransactionException
Generated at:                       Method 'commit', Source file 'CtrlTransactionHelper.java', Line #30

Error message:                      com.linbit.linstor.transaction.TransactionException: Error creating rollback entry

Call backtrace:

    Method                                   Native Class:Line number
    commit                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:30
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
    lambda$modifyKVStore$2                   N      com.linbit.linstor.api.rest.v1.KeyValueStore:104
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:235
    modifyKVStore                            N      com.linbit.linstor.api.rest.v1.KeyValueStore:97
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor209:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:569
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:840

Caused by:
==========

Category:                           RuntimeException
Class name:                         TransactionException
Class canonical name:               com.linbit.linstor.transaction.TransactionException
Generated at:                       Method 'commit', Source file 'ControllerK8sCrdTransactionMgr.java', Line #157

Error message:                      Error creating rollback entry

ErrorContext: 

Call backtrace:

    Method                                   Native Class:Line number
    commit                                   N      com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:157
    commit                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
    lambda$modifyKVStore$2                   N      com.linbit.linstor.api.rest.v1.KeyValueStore:104
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:235
    modifyKVStore                            N      com.linbit.linstor.api.rest.v1.KeyValueStore:97
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor209:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:569
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:840

Caused by:
==========

Category:                           LinStorException
Class name:                         DatabaseException
Class canonical name:               com.linbit.linstor.dbdrivers.DatabaseException
Generated at:                       Method 'createRollbackEntry', Source file 'ControllerK8sCrdRollbackMgr.java', Line #84

Error message:                      Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not found

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    createRollbackEntry                      N      com.linbit.linstor.transaction.ControllerK8sCrdRollbackMgr:84
    commit                                   N      com.linbit.linstor.transaction.ControllerK8sCrdTransactionMgr:152
    commit                                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlTransactionHelper:26
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlKvsApiCallHandler:182
    modifyKvs                                N      com.linbit.linstor.core.apicallhandler.controller.CtrlApiCallHandler:1139
    lambda$modifyKVStore$2                   N      com.linbit.linstor.api.rest.v1.KeyValueStore:104
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:235
    modifyKVStore                            N      com.linbit.linstor.api.rest.v1.KeyValueStore:97
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor209:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:569
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:840


END OF ERROR REPORT.
  1. Finding the problematic volumesnapshot resource and describing it :
k describe volumesnapshot -A                                                                                                              
Name:         velero-sonarr-config-f6gvc
Namespace:    k8s-mediaserver
Labels:       app.kubernetes.io/managed-by=Helm
              velero.io/backup-name=velero-daily-20250729023013
Annotations:  <none>
API Version:  snapshot.storage.k8s.io/v1
Kind:         VolumeSnapshot
Metadata:
  Creation Timestamp:  2025-07-30T08:31:13Z
  Finalizers:
    snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
    snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  Generate Name:     velero-sonarr-config-
  Generation:        1
  Resource Version:  224272726
  UID:               b0d85fbb-85a8-4eae-9690-9b02966c587a
Spec:
  Source:
    Persistent Volume Claim Name:  sonarr-config
  Volume Snapshot Class Name:      linstor-snapshot
Status:
  Bound Volume Snapshot Content Name:  snapcontent-b0d85fbb-85a8-4eae-9690-9b02966c587a
  Error:
    Message:     Failed to check and update snapshot content: failed to take snapshot of the volume pvc-3e656054-cd45-4968-9b7f-f34d22e53d82: "rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[68688EE0-00000-000438]'"
    Time:        2025-07-30T09:28:26Z
  Ready To Use:  false
Events:
  Type    Reason            Age   From                 Message
  ----    ------            ----  ----                 -------
  Normal  CreatingSnapshot  57m   snapshot-controller  Waiting for a snapshot k8s-mediaserver/velero-sonarr-config-f6gvc to be created by the CSI driver.
  1. checking for issues with the corresponding PVC (a non-replicated volume on a single node):
k linstor v l | grep d82                                                                                                                
| pvc-3e656054-cd45-4968-9b7f-f34d22e53d82 | n3   | hdd-thin             |     0 |    1015 | /dev/drbd1015 | 701.29 MiB | InUse  |   UpToDate | 
k exec -n piraeus-datastore daemonsets/linstor-satellite.n3 -c linstor-satellite -- xfs_repair -n /dev/drbd1015                            (⎈|sekops-omni-p0:default)
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
sb_ifree 171, counted 173
sb_fdblocks 2404043, counted 2430293
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 3
        - agno = 1
Phase 6 - check inode connectivity...
        - traversing filesystem ...
No modify flag set, skipping phase 5
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
command terminated with exit code 1

seems something might be failing silently with this filesystem... the CSI and controller seem to mount it fine though. I scaled the workload to 0 then

k exec -n piraeus-datastore daemonsets/linstor-satellite.n3 -c linstor-satellite -- xfs_repair -L /dev/drbd1015                              (⎈|sekops-omni-p0:default)
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (385:45549) is ahead of log (1:2).
Format log to cycle 388.
done

But that didnt solve the issue

  1. Next I went and manually deleted the problematic volumesnapshot AND the snapshot linstor resource:
 k linstor s d pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8 snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725                                       (⎈|sekops-omni-p0:default)
SUCCESS:
    Marked snapshot for deletion snapshot 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of resource 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8' on nodes 'n3'
SUCCESS:
    (n3) Snapshot [ZFS-Thin] with name 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of resource 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8', volume number 0 deleted.
SUCCESS:
    Deleted snapshot 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8' on 'n3'
SUCCESS:
    Snapshot 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of resource 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8' on nodes 'n3' deleted.
SUCCESS:
Description:
    Snapshot 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of resource 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8' deleted.
Details:
    Snapshot 'snapshot-b44b65e0-9142-43fb-b4d7-1b5a6ef9a725' of resource 'pvc-ace261e2-d592-4ed9-9854-32f17dc5cab8' UUID was: 1a36fa9b-311e-4a42-b2a3-f33ff4dc0681

This resoved the issue for this part of the backup, The controller immediately went into the same error for the next volume to snapshot in the list :

k describe volumesnapshot -A                                                                                                               (⎈|sekops-omni-p0:default)
Name:         velero-mail-mail-0-47z48
Namespace:    mail
Labels:       app.kubernetes.io/instance=mail
              app.kubernetes.io/name=mail
              velero.io/backup-name=velero-daily-20250729023013
              velero.io/restore-name=web-20240909184447
              velero.io/volume-snapshot-name=velero-mail-mail-0-8vgkj
Annotations:  <none>
API Version:  snapshot.storage.k8s.io/v1
Kind:         VolumeSnapshot
Metadata:
  Creation Timestamp:  2025-07-30T09:40:53Z
  Finalizers:
    snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
    snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  Generate Name:     velero-mail-mail-0-
  Generation:        1
  Resource Version:  224280384
  UID:               37247916-46f7-4cb9-8f0d-360df19d05e1
Spec:
  Source:
    Persistent Volume Claim Name:  mail-mail-0
  Volume Snapshot Class Name:      linstor-snapshot
Status:
  Bound Volume Snapshot Content Name:  snapcontent-37247916-46f7-4cb9-8f0d-360df19d05e1
  Error:
    Message:     Failed to check and update snapshot content: failed to take snapshot of the volume pvc-6f95699e-88a8-44d9-8c22-2250ac7179b6: "rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[68688EE0-00000-000451]'"
    Time:        2025-07-30T09:43:01Z
  Ready To Use:  false
Events:
  Type    Reason            Age    From                 Message
  ----    ------            ----   ----                 -------
  Normal  CreatingSnapshot  3m11s  snapshot-controller  Waiting for a snapshot mail/velero-mail-mail-0-47z48 to be created by the CSI driver.

Which will continue in a loop until the velero backup is nuked (And even then it might not be able to exit cleanly in which case you have to nuke the velero entirely) and start over. Once the velero backup is failed. The error-reports stop pouring in since the controller stopped the loop.

  1. Not sure if this is relevant : This happens after a node goes into a "lost quorum" loop, because one of the replicated volumes had some heavy writing overninght (>500GB). And this caused the volume to split brain. This volume is not included in the velero backups (no snapshot of this one), but since the volume is split brain, the entire node goes into a loop of lost quorum: unschedulable -> quorum recovered: schedulable. I suspect this issue is linked to the OP issue. Had to manually recover from that this morning before recovering from the failed snapshotter.

@ghernadi The problem here is that we dont want to have a quickfix, since this will trigger randomly and cascade into breaking the entire backup system. We need a fix. Somehing that prevents the snapshot controller from going into this failure state. Either by finding the root cause (quorum lost ? --> snapshotter fails) or by disabling the snapshotter until it is fixed (with some update to the documentation).

Hope this helps you figure out why this is happening. Thanks

bernardgut avatar Jul 30 '25 09:07 bernardgut

Sorry, I am not ignoring this issue, it rather gives me quite some headache..

I am quite sure that step 7 has nothing to do with this scenario.

Step 1 does show new information for me, true, but I'm afraid that is still not enough. To be honest, I am not even sure how the ErrorReport in step 1 can even happen, since it requires multiple threads to work on the same object but I do not see where this object could be reached from more than one thread... Also I do not see a connection when the error from step 1 should cause the error in step 2. All my attempts to forcefully reproduce step 1 (by manually injecting an error state at your given stack trace), I still fail to trigger step 2 which would be the major cause of the error of this whole scenario...

I will still try to reproduce and understand the two ErrorReports, but if you could setup a second cluster (I do not want to clutter your production of course!) where you can reproduce this issue, I'd like to ask you to do the following:

  • Turn on TRACE logging via linstor c set-log-level TRACE
  • Do whatever you need to do to trigger the error shown in step 2 (the first occurrence of Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not found is already enough)
  • Turn TRACE logging off via linstor c set-log-level INFO
  • Run linstor sos dl

Afterwards either throw the cluster away if this was just a test-cluster or do your cleanup, and send me the resulting SOS report to the email address in my profile here.

I assume there are multiple things happening concurrently either in LINSTOR or outside (or both) which are simply not covered by the ErrorReports but should be covered by the TRACE logging. Also the SOS report should collect all ErrorReports so in case there are more relevant ErrorReports I should get those too.

ghernadi avatar Aug 04 '25 09:08 ghernadi