piraeus-operator icon indicating copy to clipboard operation
piraeus-operator copied to clipboard

[BUG] 1000 resource definitions for 48 resources ?

Open bernardgut opened this issue 6 months ago • 9 comments

Hello.

Today I noticed nobody can create new resources in prod anymore :

ERROR REPORT 684A99A9-00000-006790

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.31.0
Build ID:                           a187af5c85a96bb27df87a5eab0bcf9dd6de6a34
Build time:                         2025-04-08T07:51:57+00:00
Error time:                         2025-07-03 09:51:07
Node:                               linstor-controller-7496884d58-2xthr
Thread:                             MainWorkerPool-3
Access context information

Identity:                           PUBLIC
Role:                               PUBLIC
Domain:                             PUBLIC

Peer:                               RestClient(10.229.2.171; 'linstor-csi/v1.7.1-667d213c3653ff1262bc8d0b95be898fe47babeb')

============================================================

Reported error:
===============

Category:                           Exception
Class name:                         ExhaustedPoolException
Class canonical name:               com.linbit.ExhaustedPoolException
Generated at:                       Method 'uncheckedFind', Source file 'BitmapPool.java', Line #598


Error context:
        No TCP/IP port number could be allocated for the resource
Call backtrace:

    Method                                   Native Class:Line number
    uncheckedFind                            N      com.linbit.linstor.numberpool.BitmapPool:598
    autoAllocate                             N      com.linbit.linstor.numberpool.BitmapPool:541
    autoAllocate                             N      com.linbit.linstor.numberpool.DynamicNumberPoolImpl:132
    <init>                                   N      com.linbit.linstor.storage.data.adapter.drbd.DrbdRscDfnData:95
    createDrbdRscDfnData                     N      com.linbit.linstor.storage.utils.LayerDataFactory:218
    createRscDfnData                         N      com.linbit.linstor.layer.resource.RscDrbdLayerHelper:176
    createRscDfnData                         N      com.linbit.linstor.layer.resource.RscDrbdLayerHelper:85
    ensureResourceDefinitionExists           N      com.linbit.linstor.layer.resource.AbsRscLayerHelper:117
    createRscData                            N      com.linbit.linstor.layer.resource.RscDrbdLayerHelper:261
    createRscData                            N      com.linbit.linstor.layer.resource.RscDrbdLayerHelper:85
    ensureRscDataCreated                     N      com.linbit.linstor.layer.resource.AbsRscLayerHelper:219
    ensureDataRec                            N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:238
    ensureStackDataExists                    N      com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:172
    create                                   N      com.linbit.linstor.core.objects.ResourceControllerFactory:109
    createResource                           N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscCrtApiHelper:1004
    createResourceDb                         N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscCrtApiHelper:430
    createResources                          N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscAutoPlaceApiCallHandler:425
    autoPlaceInTransaction                   N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscAutoPlaceApiCallHandler:305
    lambda$autoPlace$0                       N      com.linbit.linstor.core.apicallhandler.controller.CtrlRscAutoPlaceApiCallHandler:142
    doInScope                                N      com.linbit.linstor.core.apicallhandler.ScopeRunner:175
    lambda$fluxInScope$0                     N      com.linbit.linstor.core.apicallhandler.ScopeRunner:100
    call                                     N      reactor.core.publisher.MonoCallable:72
    trySubscribeScalarMap                    N      reactor.core.publisher.FluxFlatMap:127
    subscribeOrReturn                        N      reactor.core.publisher.MonoFlatMapMany:49
    subscribe                                N      reactor.core.publisher.Flux:8759
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:195
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2545
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:141
    subscribe                                N      reactor.core.publisher.MonoJust:55
    subscribe                                N      reactor.core.publisher.MonoDeferContextual:55
    subscribe                                N      reactor.core.publisher.Flux:8773
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:195
    onNext                                   N      reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber:129
    completePossiblyEmpty                    N      reactor.core.publisher.Operators$BaseFluxToMonoOperator:2071
    onComplete                               N      reactor.core.publisher.MonoCollect$CollectSubscriber:145
    onComplete                               N      reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber:549
    onComplete                               N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner:260
    checkTerminated                          N      reactor.core.publisher.FluxFlatMap$FlatMapMain:847
    drainLoop                                N      reactor.core.publisher.FluxFlatMap$FlatMapMain:609
    drain                                    N      reactor.core.publisher.FluxFlatMap$FlatMapMain:589
    onComplete                               N      reactor.core.publisher.FluxFlatMap$FlatMapMain:466
    checkTerminated                          N      reactor.core.publisher.FluxFlatMap$FlatMapMain:847
    drainLoop                                N      reactor.core.publisher.FluxFlatMap$FlatMapMain:609
    innerComplete                            N      reactor.core.publisher.FluxFlatMap$FlatMapMain:895
    onComplete                               N      reactor.core.publisher.FluxFlatMap$FlatMapInner:998
    onComplete                               N      reactor.core.publisher.FluxMap$MapSubscriber:144
    onComplete                               N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2205
    onComplete                               N      reactor.core.publisher.FluxUsing$UsingSubscriber:236
    onComplete                               N      reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber:85
    complete                                 N      reactor.core.publisher.FluxCreate$BaseSink:460
    drain                                    N      reactor.core.publisher.FluxCreate$BufferAsyncSink:805
    complete                                 N      reactor.core.publisher.FluxCreate$BufferAsyncSink:753
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedFluxSink:247
    drain                                    N      reactor.core.publisher.FluxCreate$SerializedFluxSink:213
    complete                                 N      reactor.core.publisher.FluxCreate$SerializedFluxSink:204
    apiCallComplete                          N      com.linbit.linstor.netcom.TcpConnectorPeer:540
    handleComplete                           N      com.linbit.linstor.proto.CommonMessageProcessor:370
    handleDataMessage                        N      com.linbit.linstor.proto.CommonMessageProcessor:300
    doProcessInOrderMessage                  N      com.linbit.linstor.proto.CommonMessageProcessor:245
    lambda$doProcessMessage$4                N      com.linbit.linstor.proto.CommonMessageProcessor:230
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8773
    onNext                                   N      reactor.core.publisher.FluxFlatMap$FlatMapMain:427
    drainAsync                               N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:453
    drain                                    N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:724
    onNext                                   N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:256
    drainFused                               N      reactor.core.publisher.SinkManyUnicast:319
    drain                                    N      reactor.core.publisher.SinkManyUnicast:362
    tryEmitNext                              N      reactor.core.publisher.SinkManyUnicast:237
    tryEmitNext                              N      reactor.core.publisher.SinkManySerialized:100
    processInOrder                           N      com.linbit.linstor.netcom.TcpConnectorPeer:446
    doProcessMessage                         N      com.linbit.linstor.proto.CommonMessageProcessor:228
    lambda$processMessage$2                  N      com.linbit.linstor.proto.CommonMessageProcessor:165
    onNext                                   N      reactor.core.publisher.FluxPeek$PeekSubscriber:185
    runAsync                                 N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:440
    run                                      N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:527
    call                                     N      reactor.core.scheduler.WorkerTask:84
    call                                     N      reactor.core.scheduler.WorkerTask:37
    run                                      N      java.util.concurrent.FutureTask:264
    run                                      N      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
    runWorker                                N      java.util.concurrent.ThreadPoolExecutor:1136
    run                                      N      java.util.concurrent.ThreadPoolExecutor$Worker:635
    run                                      N      java.lang.Thread:840


END OF ERROR REPORT.

Upon investigation I noticed that k linstor r l returns 48 resources, while k linstor rd l returns 1000 resource definitions. Everything shows "ok" and "connected"....

What is up with that ? I that supposed to happen ? How do I recover ?

Thanks

Cheers.

bernardgut avatar Jul 03 '25 10:07 bernardgut

You are likely running into an issue caused by linstor-csi not doing proper cleanup in the current versions. Prior to LINSTOR CSI 1.8.1, it would "sometimes" forget to remove the RD (but not the resources). If this goes on long enough, you could exhaust the default pool ports.

For now, what you can do is:

  1. Get the list of PVs:
k get pv -oname | cut -d/ -f2 > pvs.txt
  1. Get the list of RDs with snapshots:
k linstor -m s l | jq -r '.[][].resource_name' > snapshot-rds.txt
  1. Get the list of RDs:
k linstor -m rd l | jq -r '.[][].name' > rds.txt
  1. Get the list of all RDs that are not referenced by either PVs or snapshots
comm -13 <(cat pvs.txt snapshot-rds.txt | sort | uniq) <(cat rds.txt | sort)
  1. If the list looks "good", you can tell LINSTOR to delete the RDs:
comm -13 <(cat pvs.txt snapshot-rds.txt | sort | uniq) <(cat rds.txt | sort) | xargs k linstor rd d

WanzenBug avatar Jul 03 '25 10:07 WanzenBug

Once we ship an update for the Operator with v1.8.1, we would like to automate this process to clean up any left-over RDs. We will have to think about the best way to automate this, we don't want to accidentally delete any RD that might still be in use...

WanzenBug avatar Jul 03 '25 10:07 WanzenBug

Hello thanks for the answer.

This deleted 956 Rds. I am left with only RDs associated with PVs thanks

I really needed help! I seems my cluster is melting because I am also in the middle of this https://github.com/LINBIT/linstor-server/issues/451#issuecomment-3015126062

basically the TLDR on the sate of my cluster :

  • I am running Talos 1.10.4 (DRBD 9.2.13)
  • piraeus-operator [v2.8.1]
  • I first experienced a total failure of the controller after rebooting (apparently while velero was running snapshots)
  • Then I recovered by manually "cleaning" the db (deleting a bunch of CRDs). I thought I was good but
  • My velero went completely awol. With some nullPointers and panics in the velero logs related to volumes I thought I should just delete the whole thing and start over. So I deleted the velero namespace with all snapshots
  • Now I am stuck with 365 snapshots stuck in "deleting". And on top of it an error report that prints this error every 5 min. Exactly the same, down the ID of the SecObjectProtection cited in the error.

When something goes wrong everything just piles up it seems... I am not sure all of these are linked but it seems so to me. I am not sure this is the right place to ask but since you managed to solve my latest issue in less than 5 mins maybe you can help me at least solve the snapshots stuck in DELETING so I can reinstall velero and at least have some backups again.

Anyway thanks for any help

bernardgut avatar Jul 03 '25 16:07 bernardgut

Not sure... maybe you can just copy over one of the other SecObjectProtection, change the name and LINSTOR will continue with deletion?

WanzenBug avatar Jul 04 '25 06:07 WanzenBug

Ok so I spent quite some time cleaning up and debugging this. A few things came up that might be of note for the maintainers. Here is the TL:DR:

  • The SecObjectProtection error every 5 min came from a failed Velero snapshot (Remember: all my issues with linstor started when the node rebooted in the middle of a Velero backup). The controller would attempt to continue (?) the snapshot, then at the end fail with this issue.
  • I deleted all snapshots with k linstor s d .... following your example above and it worked! So k delete volumesnapshot ... did not but k linstor s d ... does. Then I deleted the CRDs manually. Not sure if that completes the cleanup.`
  • Interestingly, when I deleted the linstor snapshot that was causing the issue (but before I deleted the CRD), the snapshot controller immediately, and successfully, recreated it. I waited to see if the issue would pop back up again, and surely enough, it did: Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not found You could also see it in the events of the volumesnapshotcontent object when inspecting it.

This gives me the following intuition : there is a bug with the failover of the snapshotcontroller that prevents it to recover from interruption: This bug is the source of https://github.com/LINBIT/linstor-server/issues/451 and https://github.com/LINBIT/linstor-server/issues/454 and probably some other errors related to velero that you see your users submitting here. In any case I solved this ticket so I will close it.

Should I open a separate ticket for this general bug?

Thanks.

bernardgut avatar Jul 04 '25 07:07 bernardgut

@WanzenBug this popped up again at the same time after a velero backup randomly started failing with https://github.com/LINBIT/linstor-server/issues/454.

These issues are probably related.

bernardgut avatar Jul 30 '25 12:07 bernardgut

I haven't seen this since the latest update.... Will close this for now.

bernardgut avatar Aug 11 '25 09:08 bernardgut

Have seen it again. Dangling RDs seemingly after failed Velero backups

bernardgut avatar Oct 07 '25 17:10 bernardgut

Do you happen to know if these RDs are from the temporary volume clones created by velero? If so, it could be a bug in LINSTOR CSI <1.9.0: The issue is that sometimes deletion takes so long that the LINSTOR automation to recreate "missing" replicas kicks in and additional resources are created while deletion is happening.

We fixed that in newer versions, so if you update to operator 2.9.1. we can rule out that issue.

WanzenBug avatar Oct 08 '25 08:10 WanzenBug