linstor-server icon indicating copy to clipboard operation
linstor-server copied to clipboard

Satellites unable to go ONLINE -- Access to deleted snapshot definition

Open blampe opened this issue 2 years ago • 6 comments

None of my satellites are able to advance from CONNECTED to ONLINE. Every time I restart the satellite I get one of these errors:

root@pi4c:/# linstor err show 632947A1-00000-000506
ERROR REPORT 632947A1-00000-000506

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.19.1
Build ID:                           a758bf07796c374fd2004465b0d8690209b74356
Build time:                         2022-07-27T06:36:54+00:00
Error time:                         2022-09-20 16:07:08
Node:                               piraeus-op-cs-controller-797956bfc7-6fw7k

============================================================

Reported error:
===============

Category:                           Error
Class name:                         ImplementationError
Class canonical name:               com.linbit.ImplementationError
Generated at:                       Method 'execute', Source file 'TaskScheduleService.java', Line #347

Error message:                      Unhandled exception caught in com.linbit.linstor.tasks.TaskScheduleService

Error context:
    This exception was generated in the service thread of the service 'TaskScheduleService'

Call backtrace:

    Method                                   Native Class:Line number
    execute                                  N      com.linbit.linstor.tasks.TaskScheduleService:347
    run                                      N      com.linbit.linstor.tasks.TaskScheduleService:282
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           RuntimeException
Class name:                         ErrorCallbackNotImplemented
Class canonical name:               reactor.core.Exceptions.ErrorCallbackNotImplemented
Generated at:                       <UNKNOWN>

Error message:                      com.linbit.linstor.AccessToDeletedDataException: Access to deleted snapshot definition

Call backtrace:

    Method                                   Native Class:Line number

Caused by:
==========

Category:                           RuntimeException
Class name:                         AccessToDeletedDataException
Class canonical name:               com.linbit.linstor.AccessToDeletedDataException
Generated at:                       Method 'checkDeleted', Source file 'SnapshotDefinition.java', Line #313

Error message:                      Access to deleted snapshot definition

Asynchronous stage backtrace:

    Error has been observed at the following site(s):
    	|_ checkpoint ? Abort all snapshot shipments to node
    Stack trace:

Call backtrace:

    Method                                   Native Class:Line number
    checkDeleted                             N      com.linbit.linstor.core.objects.SnapshotDefinition:313

Suppressed exception 1 of 1:
===============
Category:                           RuntimeException
Class name:                         OnAssemblyException
Class canonical name:               reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at:                       Method 'checkDeleted', Source file 'SnapshotDefinition.java', Line #313

Error message:
Error has been observed at the following site(s):
	|_ checkpoint ��� Abort all snapshot shipments to node
Stack trace:

Call backtrace:

    Method                                   Native Class:Line number
    checkDeleted                             N      com.linbit.linstor.core.objects.SnapshotDefinition:313
    getFlags                                 N      com.linbit.linstor.core.objects.SnapshotDefinition:257
    abortAllShippingPrivilegedInTransaction  N      com.linbit.linstor.core.apicallhandler.controller.CtrlSnapshotShippingAbortHandler:114
    lambda$abortAllShippingPrivileged$0      N      com.linbit.linstor.core.apicallhandler.controller.CtrlSnapshotShippingAbortHandler:99
    doInScope                                N      com.linbit.linstor.core.apicallhandler.ScopeRunner:150
    lambda$fluxInScope$0                     N      com.linbit.linstor.core.apicallhandler.ScopeRunner:76
    call                                     N      reactor.core.publisher.MonoCallable:91
    trySubscribeScalarMap                    N      reactor.core.publisher.FluxFlatMap:126
    subscribeOrReturn                        N      reactor.core.publisher.MonoFlatMapMany:49
    subscribe                                N      reactor.core.publisher.Flux:8343
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2344
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
    subscribe                                N      reactor.core.publisher.MonoCurrentContext:35
    subscribe                                N      reactor.core.publisher.Flux:8357
    subscribeWith                            N      reactor.core.publisher.Flux:8526
    subscribe                                N      reactor.core.publisher.Flux:8327
    subscribe                                N      reactor.core.publisher.Flux:8254
    subscribe                                N      reactor.core.publisher.Flux:8172
    add                                      N      com.linbit.linstor.tasks.ReconnectorTask:181
    run                                      N      com.linbit.linstor.tasks.PingTask:90
    execute                                  N      com.linbit.linstor.tasks.TaskScheduleService:340
    run                                      N      com.linbit.linstor.tasks.TaskScheduleService:282
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.

blampe avatar Sep 20 '22 16:09 blampe

Restarting the controller should fix this issue.

ghernadi avatar Sep 21 '22 06:09 ghernadi

Restarting the controller should fix this issue.

@ghernadi the controller is no longer able to start due to a NPE. The cluster is now inoperable.

root@piraeus-op-cs-controller-69c7b49546-cbd8h:/# cat /var/log/linstor-controller/Error*.log
ERROR REPORT 632B5E98-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.20.0-rc.1
Build ID:                           095b9bef67d46f217ee394e0262c4e96baef0c45
Build time:                         2022-09-20T12:44:59+00:00
Error time:                         2022-09-21 18:58:18
Node:                               piraeus-op-cs-controller-69c7b49546-cbd8h

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method '<init>', Source file 'Snapshot.java', Line #82


Call backtrace:

    Method                                   Native Class:Line number
    <init>                                   N      com.linbit.linstor.core.objects.Snapshot:82
    load                                     N      com.linbit.linstor.core.objects.SnapshotDbDriver:195
    load                                     N      com.linbit.linstor.core.objects.SnapshotDbDriver:48
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:175
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:102
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:541
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:354
    main                                     N      com.linbit.linstor.core.Controller:602


END OF ERROR REPORT.
ERROR REPORT 632B5E98-00000-000001

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.20.0-rc.1
Build ID:                           095b9bef67d46f217ee394e0262c4e96baef0c45
Build time:                         2022-09-20T12:44:59+00:00
Error time:                         2022-09-21 18:58:18
Node:                               piraeus-op-cs-controller-69c7b49546-cbd8h

============================================================

Reported error:
===============

Description:
    Unhandled exception

Category:                           LinStorException
Class name:                         SystemServiceStartException
Class canonical name:               com.linbit.SystemServiceStartException
Generated at:                       Method 'startSystemServices', Source file 'ApplicationLifecycleManager.java', Line #103

Error message:                      Unhandled exception

Call backtrace:

    Method                                   Native Class:Line number
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:103
    start                                    N      com.linbit.linstor.core.Controller:354
    main                                     N      com.linbit.linstor.core.Controller:602

Caused by:
==========

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method '<init>', Source file 'Snapshot.java', Line #82


Call backtrace:

    Method                                   Native Class:Line number
    <init>                                   N      com.linbit.linstor.core.objects.Snapshot:82
    load                                     N      com.linbit.linstor.core.objects.SnapshotDbDriver:195
    load                                     N      com.linbit.linstor.core.objects.SnapshotDbDriver:48
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:175
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:102
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:541
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:354
    main                                     N      com.linbit.linstor.core.Controller:602


END OF ERROR REPORT.

blampe avatar Sep 21 '22 23:09 blampe

You have a Snapshot without a corresponding SnapshotDefinition... Any idea how you ended in this situation? Linstor usually deletes things in a cascading manner..

Feel free to send me a database dump (to the email from my GitHub profile) and let me see if I can delete the Snapshot(s) so we can get your controller up and running again.

ghernadi avatar Sep 22 '22 04:09 ghernadi

You have a Snapshot without a corresponding SnapshotDefinition... Any idea how you ended in this situation? Linstor usually deletes things in a cascading manner..

@ghernadi I had some snapshots stuck in DELETING, perhaps the SnapshotDefinition was deleted but not the Snapshot? (This is using the k8s backend FWIW.) https://github.com/LINBIT/linstor-server/issues/317

Feel free to send me a database dump (to the email from my GitHub profile) and let me see if I can delete the Snapshot(s) so we can get your controller up and running again.

Thanks, let me try some cleanup first and I'll let you know how that goes.

blampe avatar Sep 22 '22 15:09 blampe

@ghernadi I had some snapshots stuck in DELETING, perhaps the SnapshotDefinition was deleted but not the Snapshot? (This is using the k8s backend FWIW.) #317

I doubt. As I said, Linstor tries to delete objects cascading. The stacktrace also shows that Snapshot.delete(..) was called, it tries to remove itself from its SnapshotDefinition, which fails because of the (quite interesting) Error. Deleting a SnapshotDefinition throws actually an ImplementationError when it still has Snapshots. So it is really not that easy to lose a SnapshotDefinition while keeping the Snapshot. Usually my first guess is that someone tried to fix something in the database - especially since Snapshots share the same "table" (or CRD in k8s) with Resources internally...

ghernadi avatar Sep 22 '22 20:09 ghernadi

I was able to get everything up and running again, but it required deleting everything (resources, volumedefinitions, etc.) with a snapshot_name set. Not a big deal if k8s doesn't know about the snapshots, they still exist on disk. I wasn't able to find the specific snapshot that was the culprit, but I'll send you a dump of my CRDs if you have the desire or patience to poke through them.

So it is really not that easy to lose a SnapshotDefinition while keeping the Snapshot. Usually my first guess is that someone tried to fix something in the database - especially since Snapshots share the same "table" (or CRD in k8s) with Resources internally...

That "someone" could likely be the CSI driver -- I have other operators creating and destroying snapshots at will. It seems some sequence of steps like CSI requests snapshot deletion → Linstor server deletes it → CSI driver cleans up some but not all of the relevant CRDs... maybe due to an ungraceful exit?

blampe avatar Sep 23 '22 05:09 blampe