[BUG] 1000 resource definitions for 48 resources ?
Hello.
Today I noticed nobody can create new resources in prod anymore :
ERROR REPORT 684A99A9-00000-006790
============================================================
Application: LINBIT? LINSTOR
Module: Controller
Version: 1.31.0
Build ID: a187af5c85a96bb27df87a5eab0bcf9dd6de6a34
Build time: 2025-04-08T07:51:57+00:00
Error time: 2025-07-03 09:51:07
Node: linstor-controller-7496884d58-2xthr
Thread: MainWorkerPool-3
Access context information
Identity: PUBLIC
Role: PUBLIC
Domain: PUBLIC
Peer: RestClient(10.229.2.171; 'linstor-csi/v1.7.1-667d213c3653ff1262bc8d0b95be898fe47babeb')
============================================================
Reported error:
===============
Category: Exception
Class name: ExhaustedPoolException
Class canonical name: com.linbit.ExhaustedPoolException
Generated at: Method 'uncheckedFind', Source file 'BitmapPool.java', Line #598
Error context:
No TCP/IP port number could be allocated for the resource
Call backtrace:
Method Native Class:Line number
uncheckedFind N com.linbit.linstor.numberpool.BitmapPool:598
autoAllocate N com.linbit.linstor.numberpool.BitmapPool:541
autoAllocate N com.linbit.linstor.numberpool.DynamicNumberPoolImpl:132
<init> N com.linbit.linstor.storage.data.adapter.drbd.DrbdRscDfnData:95
createDrbdRscDfnData N com.linbit.linstor.storage.utils.LayerDataFactory:218
createRscDfnData N com.linbit.linstor.layer.resource.RscDrbdLayerHelper:176
createRscDfnData N com.linbit.linstor.layer.resource.RscDrbdLayerHelper:85
ensureResourceDefinitionExists N com.linbit.linstor.layer.resource.AbsRscLayerHelper:117
createRscData N com.linbit.linstor.layer.resource.RscDrbdLayerHelper:261
createRscData N com.linbit.linstor.layer.resource.RscDrbdLayerHelper:85
ensureRscDataCreated N com.linbit.linstor.layer.resource.AbsRscLayerHelper:219
ensureDataRec N com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:238
ensureStackDataExists N com.linbit.linstor.layer.resource.CtrlRscLayerDataFactory:172
create N com.linbit.linstor.core.objects.ResourceControllerFactory:109
createResource N com.linbit.linstor.core.apicallhandler.controller.CtrlRscCrtApiHelper:1004
createResourceDb N com.linbit.linstor.core.apicallhandler.controller.CtrlRscCrtApiHelper:430
createResources N com.linbit.linstor.core.apicallhandler.controller.CtrlRscAutoPlaceApiCallHandler:425
autoPlaceInTransaction N com.linbit.linstor.core.apicallhandler.controller.CtrlRscAutoPlaceApiCallHandler:305
lambda$autoPlace$0 N com.linbit.linstor.core.apicallhandler.controller.CtrlRscAutoPlaceApiCallHandler:142
doInScope N com.linbit.linstor.core.apicallhandler.ScopeRunner:175
lambda$fluxInScope$0 N com.linbit.linstor.core.apicallhandler.ScopeRunner:100
call N reactor.core.publisher.MonoCallable:72
trySubscribeScalarMap N reactor.core.publisher.FluxFlatMap:127
subscribeOrReturn N reactor.core.publisher.MonoFlatMapMany:49
subscribe N reactor.core.publisher.Flux:8759
onNext N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:195
request N reactor.core.publisher.Operators$ScalarSubscription:2545
onSubscribe N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:141
subscribe N reactor.core.publisher.MonoJust:55
subscribe N reactor.core.publisher.MonoDeferContextual:55
subscribe N reactor.core.publisher.Flux:8773
onNext N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:195
onNext N reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber:129
completePossiblyEmpty N reactor.core.publisher.Operators$BaseFluxToMonoOperator:2071
onComplete N reactor.core.publisher.MonoCollect$CollectSubscriber:145
onComplete N reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber:549
onComplete N reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner:260
checkTerminated N reactor.core.publisher.FluxFlatMap$FlatMapMain:847
drainLoop N reactor.core.publisher.FluxFlatMap$FlatMapMain:609
drain N reactor.core.publisher.FluxFlatMap$FlatMapMain:589
onComplete N reactor.core.publisher.FluxFlatMap$FlatMapMain:466
checkTerminated N reactor.core.publisher.FluxFlatMap$FlatMapMain:847
drainLoop N reactor.core.publisher.FluxFlatMap$FlatMapMain:609
innerComplete N reactor.core.publisher.FluxFlatMap$FlatMapMain:895
onComplete N reactor.core.publisher.FluxFlatMap$FlatMapInner:998
onComplete N reactor.core.publisher.FluxMap$MapSubscriber:144
onComplete N reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2205
onComplete N reactor.core.publisher.FluxUsing$UsingSubscriber:236
onComplete N reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber:85
complete N reactor.core.publisher.FluxCreate$BaseSink:460
drain N reactor.core.publisher.FluxCreate$BufferAsyncSink:805
complete N reactor.core.publisher.FluxCreate$BufferAsyncSink:753
drainLoop N reactor.core.publisher.FluxCreate$SerializedFluxSink:247
drain N reactor.core.publisher.FluxCreate$SerializedFluxSink:213
complete N reactor.core.publisher.FluxCreate$SerializedFluxSink:204
apiCallComplete N com.linbit.linstor.netcom.TcpConnectorPeer:540
handleComplete N com.linbit.linstor.proto.CommonMessageProcessor:370
handleDataMessage N com.linbit.linstor.proto.CommonMessageProcessor:300
doProcessInOrderMessage N com.linbit.linstor.proto.CommonMessageProcessor:245
lambda$doProcessMessage$4 N com.linbit.linstor.proto.CommonMessageProcessor:230
subscribe N reactor.core.publisher.FluxDefer:46
subscribe N reactor.core.publisher.Flux:8773
onNext N reactor.core.publisher.FluxFlatMap$FlatMapMain:427
drainAsync N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:453
drain N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:724
onNext N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:256
drainFused N reactor.core.publisher.SinkManyUnicast:319
drain N reactor.core.publisher.SinkManyUnicast:362
tryEmitNext N reactor.core.publisher.SinkManyUnicast:237
tryEmitNext N reactor.core.publisher.SinkManySerialized:100
processInOrder N com.linbit.linstor.netcom.TcpConnectorPeer:446
doProcessMessage N com.linbit.linstor.proto.CommonMessageProcessor:228
lambda$processMessage$2 N com.linbit.linstor.proto.CommonMessageProcessor:165
onNext N reactor.core.publisher.FluxPeek$PeekSubscriber:185
runAsync N reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:440
run N reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:527
call N reactor.core.scheduler.WorkerTask:84
call N reactor.core.scheduler.WorkerTask:37
run N java.util.concurrent.FutureTask:264
run N java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
runWorker N java.util.concurrent.ThreadPoolExecutor:1136
run N java.util.concurrent.ThreadPoolExecutor$Worker:635
run N java.lang.Thread:840
END OF ERROR REPORT.
Upon investigation I noticed that k linstor r l returns 48 resources, while k linstor rd l returns 1000 resource definitions. Everything shows "ok" and "connected"....
What is up with that ? I that supposed to happen ? How do I recover ?
Thanks
Cheers.
You are likely running into an issue caused by linstor-csi not doing proper cleanup in the current versions. Prior to LINSTOR CSI 1.8.1, it would "sometimes" forget to remove the RD (but not the resources). If this goes on long enough, you could exhaust the default pool ports.
For now, what you can do is:
- Get the list of PVs:
k get pv -oname | cut -d/ -f2 > pvs.txt
- Get the list of RDs with snapshots:
k linstor -m s l | jq -r '.[][].resource_name' > snapshot-rds.txt
- Get the list of RDs:
k linstor -m rd l | jq -r '.[][].name' > rds.txt
- Get the list of all RDs that are not referenced by either PVs or snapshots
comm -13 <(cat pvs.txt snapshot-rds.txt | sort | uniq) <(cat rds.txt | sort)
- If the list looks "good", you can tell LINSTOR to delete the RDs:
comm -13 <(cat pvs.txt snapshot-rds.txt | sort | uniq) <(cat rds.txt | sort) | xargs k linstor rd d
Once we ship an update for the Operator with v1.8.1, we would like to automate this process to clean up any left-over RDs. We will have to think about the best way to automate this, we don't want to accidentally delete any RD that might still be in use...
Hello thanks for the answer.
This deleted 956 Rds. I am left with only RDs associated with PVs thanks
I really needed help! I seems my cluster is melting because I am also in the middle of this https://github.com/LINBIT/linstor-server/issues/451#issuecomment-3015126062
basically the TLDR on the sate of my cluster :
- I am running Talos 1.10.4 (DRBD 9.2.13)
- piraeus-operator [v2.8.1]
- I first experienced a total failure of the controller after rebooting (apparently while velero was running snapshots)
- Then I recovered by manually "cleaning" the db (deleting a bunch of CRDs). I thought I was good but
- My velero went completely awol. With some nullPointers and panics in the velero logs related to volumes I thought I should just delete the whole thing and start over. So I deleted the velero namespace with all snapshots
- Now I am stuck with 365 snapshots stuck in "deleting". And on top of it an error report that prints this error every 5 min. Exactly the same, down the ID of the
SecObjectProtectioncited in the error.
When something goes wrong everything just piles up it seems... I am not sure all of these are linked but it seems so to me. I am not sure this is the right place to ask but since you managed to solve my latest issue in less than 5 mins maybe you can help me at least solve the snapshots stuck in DELETING so I can reinstall velero and at least have some backups again.
Anyway thanks for any help
Not sure... maybe you can just copy over one of the other SecObjectProtection, change the name and LINSTOR will continue with deletion?
Ok so I spent quite some time cleaning up and debugging this. A few things came up that might be of note for the maintainers. Here is the TL:DR:
- The
SecObjectProtectionerror every 5 min came from a failed Velero snapshot (Remember: all my issues with linstor started when the node rebooted in the middle of a Velero backup). The controller would attempt to continue (?) the snapshot, then at the end fail with this issue. - I deleted all snapshots with
k linstor s d ....following your example above and it worked! Sok delete volumesnapshot ...did not butk linstor s d ...does. Then I deleted the CRDs manually. Not sure if that completes the cleanup.` - Interestingly, when I deleted the linstor snapshot that was causing the issue (but before I deleted the CRD), the snapshot controller immediately, and successfully, recreated it. I waited to see if the issue would pop back up again, and surely enough, it did:
Resource SecObjectProtection/b6a849d48698fe9940ba0d46833827435bd8c66911bf4735b623d1d0dfa74686 not foundYou could also see it in the events of thevolumesnapshotcontentobject when inspecting it.
This gives me the following intuition : there is a bug with the failover of the snapshotcontroller that prevents it to recover from interruption: This bug is the source of https://github.com/LINBIT/linstor-server/issues/451 and https://github.com/LINBIT/linstor-server/issues/454 and probably some other errors related to velero that you see your users submitting here. In any case I solved this ticket so I will close it.
Should I open a separate ticket for this general bug?
Thanks.
@WanzenBug this popped up again at the same time after a velero backup randomly started failing with https://github.com/LINBIT/linstor-server/issues/454.
These issues are probably related.
I haven't seen this since the latest update.... Will close this for now.
Have seen it again. Dangling RDs seemingly after failed Velero backups
Do you happen to know if these RDs are from the temporary volume clones created by velero? If so, it could be a bug in LINSTOR CSI <1.9.0: The issue is that sometimes deletion takes so long that the LINSTOR automation to recreate "missing" replicas kicks in and additional resources are created while deletion is happening.
We fixed that in newer versions, so if you update to operator 2.9.1. we can rule out that issue.