linstor-server
linstor-server copied to clipboard
Satellites unable to go ONLINE -- Access to deleted snapshot definition
None of my satellites are able to advance from CONNECTED to ONLINE. Every time I restart the satellite I get one of these errors:
root@pi4c:/# linstor err show 632947A1-00000-000506
ERROR REPORT 632947A1-00000-000506
============================================================
Application: LINBIT�� LINSTOR
Module: Controller
Version: 1.19.1
Build ID: a758bf07796c374fd2004465b0d8690209b74356
Build time: 2022-07-27T06:36:54+00:00
Error time: 2022-09-20 16:07:08
Node: piraeus-op-cs-controller-797956bfc7-6fw7k
============================================================
Reported error:
===============
Category: Error
Class name: ImplementationError
Class canonical name: com.linbit.ImplementationError
Generated at: Method 'execute', Source file 'TaskScheduleService.java', Line #347
Error message: Unhandled exception caught in com.linbit.linstor.tasks.TaskScheduleService
Error context:
This exception was generated in the service thread of the service 'TaskScheduleService'
Call backtrace:
Method Native Class:Line number
execute N com.linbit.linstor.tasks.TaskScheduleService:347
run N com.linbit.linstor.tasks.TaskScheduleService:282
run N java.lang.Thread:829
Caused by:
==========
Category: RuntimeException
Class name: ErrorCallbackNotImplemented
Class canonical name: reactor.core.Exceptions.ErrorCallbackNotImplemented
Generated at: <UNKNOWN>
Error message: com.linbit.linstor.AccessToDeletedDataException: Access to deleted snapshot definition
Call backtrace:
Method Native Class:Line number
Caused by:
==========
Category: RuntimeException
Class name: AccessToDeletedDataException
Class canonical name: com.linbit.linstor.AccessToDeletedDataException
Generated at: Method 'checkDeleted', Source file 'SnapshotDefinition.java', Line #313
Error message: Access to deleted snapshot definition
Asynchronous stage backtrace:
Error has been observed at the following site(s):
|_ checkpoint ? Abort all snapshot shipments to node
Stack trace:
Call backtrace:
Method Native Class:Line number
checkDeleted N com.linbit.linstor.core.objects.SnapshotDefinition:313
Suppressed exception 1 of 1:
===============
Category: RuntimeException
Class name: OnAssemblyException
Class canonical name: reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at: Method 'checkDeleted', Source file 'SnapshotDefinition.java', Line #313
Error message:
Error has been observed at the following site(s):
|_ checkpoint ��� Abort all snapshot shipments to node
Stack trace:
Call backtrace:
Method Native Class:Line number
checkDeleted N com.linbit.linstor.core.objects.SnapshotDefinition:313
getFlags N com.linbit.linstor.core.objects.SnapshotDefinition:257
abortAllShippingPrivilegedInTransaction N com.linbit.linstor.core.apicallhandler.controller.CtrlSnapshotShippingAbortHandler:114
lambda$abortAllShippingPrivileged$0 N com.linbit.linstor.core.apicallhandler.controller.CtrlSnapshotShippingAbortHandler:99
doInScope N com.linbit.linstor.core.apicallhandler.ScopeRunner:150
lambda$fluxInScope$0 N com.linbit.linstor.core.apicallhandler.ScopeRunner:76
call N reactor.core.publisher.MonoCallable:91
trySubscribeScalarMap N reactor.core.publisher.FluxFlatMap:126
subscribeOrReturn N reactor.core.publisher.MonoFlatMapMany:49
subscribe N reactor.core.publisher.Flux:8343
onNext N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
request N reactor.core.publisher.Operators$ScalarSubscription:2344
onSubscribe N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
subscribe N reactor.core.publisher.MonoCurrentContext:35
subscribe N reactor.core.publisher.Flux:8357
subscribeWith N reactor.core.publisher.Flux:8526
subscribe N reactor.core.publisher.Flux:8327
subscribe N reactor.core.publisher.Flux:8254
subscribe N reactor.core.publisher.Flux:8172
add N com.linbit.linstor.tasks.ReconnectorTask:181
run N com.linbit.linstor.tasks.PingTask:90
execute N com.linbit.linstor.tasks.TaskScheduleService:340
run N com.linbit.linstor.tasks.TaskScheduleService:282
run N java.lang.Thread:829
END OF ERROR REPORT.
Restarting the controller should fix this issue.
Restarting the controller should fix this issue.
@ghernadi the controller is no longer able to start due to a NPE. The cluster is now inoperable.
root@piraeus-op-cs-controller-69c7b49546-cbd8h:/# cat /var/log/linstor-controller/Error*.log
ERROR REPORT 632B5E98-00000-000000
============================================================
Application: LINBIT? LINSTOR
Module: Controller
Version: 1.20.0-rc.1
Build ID: 095b9bef67d46f217ee394e0262c4e96baef0c45
Build time: 2022-09-20T12:44:59+00:00
Error time: 2022-09-21 18:58:18
Node: piraeus-op-cs-controller-69c7b49546-cbd8h
============================================================
Reported error:
===============
Category: RuntimeException
Class name: NullPointerException
Class canonical name: java.lang.NullPointerException
Generated at: Method '<init>', Source file 'Snapshot.java', Line #82
Call backtrace:
Method Native Class:Line number
<init> N com.linbit.linstor.core.objects.Snapshot:82
load N com.linbit.linstor.core.objects.SnapshotDbDriver:195
load N com.linbit.linstor.core.objects.SnapshotDbDriver:48
loadAll N com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:175
loadAll N com.linbit.linstor.dbdrivers.AbsDatabaseDriver:102
loadAll N com.linbit.linstor.dbdrivers.DatabaseLoader:541
loadCoreObjects N com.linbit.linstor.core.DbDataInitializer:176
initialize N com.linbit.linstor.core.DbDataInitializer:108
startSystemServices N com.linbit.linstor.core.ApplicationLifecycleManager:87
start N com.linbit.linstor.core.Controller:354
main N com.linbit.linstor.core.Controller:602
END OF ERROR REPORT.
ERROR REPORT 632B5E98-00000-000001
============================================================
Application: LINBIT? LINSTOR
Module: Controller
Version: 1.20.0-rc.1
Build ID: 095b9bef67d46f217ee394e0262c4e96baef0c45
Build time: 2022-09-20T12:44:59+00:00
Error time: 2022-09-21 18:58:18
Node: piraeus-op-cs-controller-69c7b49546-cbd8h
============================================================
Reported error:
===============
Description:
Unhandled exception
Category: LinStorException
Class name: SystemServiceStartException
Class canonical name: com.linbit.SystemServiceStartException
Generated at: Method 'startSystemServices', Source file 'ApplicationLifecycleManager.java', Line #103
Error message: Unhandled exception
Call backtrace:
Method Native Class:Line number
startSystemServices N com.linbit.linstor.core.ApplicationLifecycleManager:103
start N com.linbit.linstor.core.Controller:354
main N com.linbit.linstor.core.Controller:602
Caused by:
==========
Category: RuntimeException
Class name: NullPointerException
Class canonical name: java.lang.NullPointerException
Generated at: Method '<init>', Source file 'Snapshot.java', Line #82
Call backtrace:
Method Native Class:Line number
<init> N com.linbit.linstor.core.objects.Snapshot:82
load N com.linbit.linstor.core.objects.SnapshotDbDriver:195
load N com.linbit.linstor.core.objects.SnapshotDbDriver:48
loadAll N com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:175
loadAll N com.linbit.linstor.dbdrivers.AbsDatabaseDriver:102
loadAll N com.linbit.linstor.dbdrivers.DatabaseLoader:541
loadCoreObjects N com.linbit.linstor.core.DbDataInitializer:176
initialize N com.linbit.linstor.core.DbDataInitializer:108
startSystemServices N com.linbit.linstor.core.ApplicationLifecycleManager:87
start N com.linbit.linstor.core.Controller:354
main N com.linbit.linstor.core.Controller:602
END OF ERROR REPORT.
You have a Snapshot without a corresponding SnapshotDefinition... Any idea how you ended in this situation? Linstor usually deletes things in a cascading manner..
Feel free to send me a database dump (to the email from my GitHub profile) and let me see if I can delete the Snapshot(s) so we can get your controller up and running again.
You have a Snapshot without a corresponding SnapshotDefinition... Any idea how you ended in this situation? Linstor usually deletes things in a cascading manner..
@ghernadi I had some snapshots stuck in DELETING, perhaps the SnapshotDefinition was deleted but not the Snapshot? (This is using the k8s backend FWIW.) https://github.com/LINBIT/linstor-server/issues/317
Feel free to send me a database dump (to the email from my GitHub profile) and let me see if I can delete the Snapshot(s) so we can get your controller up and running again.
Thanks, let me try some cleanup first and I'll let you know how that goes.
@ghernadi I had some snapshots stuck in DELETING, perhaps the SnapshotDefinition was deleted but not the Snapshot? (This is using the k8s backend FWIW.) #317
I doubt. As I said, Linstor tries to delete objects cascading. The stacktrace also shows that Snapshot.delete(..)
was called, it tries to remove itself from its SnapshotDefinition
, which fails because of the (quite interesting) Error.
Deleting a SnapshotDefinition
throws actually an ImplementationError
when it still has Snapshots
. So it is really not that easy to lose a SnapshotDefinition
while keeping the Snapshot
. Usually my first guess is that someone tried to fix something in the database - especially since Snapshots
share the same "table" (or CRD in k8s) with Resources
internally...
I was able to get everything up and running again, but it required deleting everything (resources, volumedefinitions, etc.) with a snapshot_name
set. Not a big deal if k8s doesn't know about the snapshots, they still exist on disk. I wasn't able to find the specific snapshot that was the culprit, but I'll send you a dump of my CRDs if you have the desire or patience to poke through them.
So it is really not that easy to lose a SnapshotDefinition while keeping the Snapshot. Usually my first guess is that someone tried to fix something in the database - especially since Snapshots share the same "table" (or CRD in k8s) with Resources internally...
That "someone" could likely be the CSI driver -- I have other operators creating and destroying snapshots at will. It seems some sequence of steps like CSI requests snapshot deletion → Linstor server deletes it → CSI driver cleans up some but not all of the relevant CRDs... maybe due to an ungraceful exit?