piraeus-operator
piraeus-operator copied to clipboard
Storage pool can not be deleted...
Since the update to 1.7.0 the linstor-controller every few seconds throws the error in the log The specified storage pool 'lvm-thin' on node 'host9' can not be deleted as volumes / snapshot-volumes are still using it.. Even if all containers with DRBD resources are scaled down and no DRBD resource is mounted on the cluster nodes the message doesn't disappear.
Is this a bug during the upgrade routine or is it smth we have to resolve manually?
Here is the error report:
============================================================
Application: LINBIT�� LINSTOR
Module: Controller
Version: 1.17.0
Build ID: 7e646d83dbbadf1ec066e1bc8b29ae018aff1f66
Build time: 2021-12-09T07:27:52+00:00
Error time: 2022-02-01 18:08:56
Node: piraeus-op-cs-controller-6f7f457db-rs2q5
Peer: RestClient(10.42.1.105; 'Go-http-client/1.1')
============================================================
Reported error:
===============
Category: RuntimeException
Class name: ApiRcException
Class canonical name: com.linbit.linstor.core.apicallhandler.response.ApiRcException
Generated at: Method 'deleteStorPoolInTransaction', Source file 'CtrlStorPoolApiCallHandler.java', Line #301
Error message: The specified storage pool 'lvm-thin' on node 'host9' can not be deleted as volumes / snapshot-volumes are still using it.
Error context:
The specified storage pool 'lvm-thin' on node 'host9' can not be deleted as volumes / snapshot-volumes are still using it.
Call backtrace:
Method Native Class:Line number
deleteStorPoolInTransaction N com.linbit.linstor.core.apicallhandler.controller.CtrlStorPoolApiCallHandler:301
lambda$deleteStorPool$2 N com.linbit.linstor.core.apicallhandler.controller.CtrlStorPoolApiCallHandler:213
doInScope N com.linbit.linstor.core.apicallhandler.ScopeRunner:147
lambda$fluxInScope$0 N com.linbit.linstor.core.apicallhandler.ScopeRunner:75
call N reactor.core.publisher.MonoCallable:91
trySubscribeScalarMap N reactor.core.publisher.FluxFlatMap:126
subscribeOrReturn N reactor.core.publisher.MonoFlatMapMany:49
subscribe N reactor.core.publisher.Flux:8343
onNext N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
request N reactor.core.publisher.Operators$ScalarSubscription:2344
onSubscribe N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
subscribe N reactor.core.publisher.MonoCurrentContext:35
subscribe N reactor.core.publisher.Flux:8357
onNext N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
request N reactor.core.publisher.Operators$ScalarSubscription:2344
onSubscribe N reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
subscribe N reactor.core.publisher.MonoCurrentContext:35
subscribe N reactor.core.publisher.Mono:4252
subscribeWith N reactor.core.publisher.Mono:4363
subscribe N reactor.core.publisher.Mono:4223
subscribe N reactor.core.publisher.Mono:4159
subscribe N reactor.core.publisher.Mono:4131
doFlux N com.linbit.linstor.api.rest.v1.RequestHelper:304
deleteStorPool N com.linbit.linstor.api.rest.v1.StoragePools:330
invoke N jdk.internal.reflect.GeneratedMethodAccessor26:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:566
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:124
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:167
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker:159
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:79
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:469
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:391
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:80
run N org.glassfish.jersey.server.ServerRuntime$1:253
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:232
handle N org.glassfish.jersey.server.ApplicationHandler:680
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:200
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:569
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:549
run N java.lang.Thread:829
END OF ERROR REPORT.```
My question is why the storage pool should be deleted? The only way this is triggered is if operator.satelliteSet.storagePools was modified (removed entries).
There are no additional checks happening in the operator, so it just tries to delete the storage pool, even if there are resources or snapshots still present. Note that this does include simple replicas, even if the drbd device is not actively mounted.
So my advise would be:
- Check that the
LinstorSatelliteSetresource has the expected storage pools- If not: edit it to have all expected storage pools
- If removal was intentional: you have to manually delete those resources on the node, there is no automatic removal by the operator.
Thanks @WanzenBug for the fast reply. Actually neither storage pool nor the list of nodes has changed. And these errors occur on almost all our clusters which have been updated to Piraeus 1.7.
Here are screenshots from the linstor cmd output:

The nodes and store pools look good and the storage pool is also referenced correctly for the resources.
I've checked the output of kubectl get LinstorSatelliteSet.piraeus.linbit.com piraeus-op-ns. Is it maybe a problem that the storage pools haven't be announced in the values file for the new CRD configuration? In the output the storagePools are empty (because we haven't declared it on the helm upgrade) but the SatelliteStatus OFC lists the storage pools which have been created on the first Linstor deployment on the cluster:
sslSecret: null
storagePools:
lvmPools: []
lvmThinPools: []
zfsPools: []
tolerations: []
status:
SatelliteStatuses:
- connectionStatus: ONLINE
nodeName: de-fra-node10
registeredOnController: true
storagePoolStatus:
- freeCapacity: 9223372036854775807
name: DfltDisklessStorPool
nodeName: de-fra-node10
provider: DISKLESS
totalCapacity: 9223372036854775807
- freeCapacity: 1132126535
name: lvm-thin
nodeName: de-fra-node10
provider: LVM_THIN
totalCapacity: 1677721600
- connectionStatus: ONLINE
nodeName: de-fra-node8
registeredOnController: true
storagePoolStatus:
- freeCapacity: 9223372036854775807
name: DfltDisklessStorPool
nodeName: de-fra-node8
provider: DISKLESS
totalCapacity: 9223372036854775807
- freeCapacity: 1132126535
name: lvm-thin
nodeName: de-fra-node8
provider: LVM_THIN
totalCapacity: 1677721600
- connectionStatus: ONLINE
nodeName: de-fra-node9
registeredOnController: true
storagePoolStatus:
- freeCapacity: 9223372036854775807
name: DfltDisklessStorPool
nodeName: de-fra-node9
provider: DISKLESS
totalCapacity: 9223372036854775807
- freeCapacity: 1132126535
name: lvm-thin
nodeName: de-fra-node9
provider: LVM_THIN
totalCapacity: 1677721600
errors:
- "Message: 'The specified storage pool 'lvm-thin' on node 'de-fra-node9' can not
be deleted as volumes / snapshot-volumes are still using it.'; Details: 'Volumes
/ snapshot-volumes that are still using the storage pool: \n Node name: 'de-fra-node9',
resource name: 'pvc-1e59589f-e04e-4aee-a1c6-0561a764a7e8', volume number: 0\n
\ Node name: 'de-fra-node9', resource name: 'pvc-2c8ea040-9651-4501-a0d9-7b3920c82ec8',
volume number: 0\n Node name: 'de-fra-node9', resource name: 'pvc-4506c792-6fbb-43b6-a6ca-745084259e0d',
volume number: 0\n Node name: 'de-fra-node9', resource name: 'pvc-4e6cf7f8-2a42-4129-9a9b-cba310b3ed9e',
volume number: 0\n Node name: 'de-fra-node9', resource name: 'pvc-7afac331-4319-4ca7-b587-1ec267dc63b8',
volume number: 0\n Node name: 'de-fra-node9', resource name: 'pvc-7c07be83-43aa-44af-b9c6-c600339ad6a8',
volume number: 0\n Node name: 'de-fra-node9', resource name: 'pvc-8366a898-20ca-4f71-abee-cc3c40ff8bf1',
volume number: 0\n Node name: 'de-fra-node9', resource name: 'pvc-a2bff230-7ca6-4e93-a273-15d217967def',
volume number: 0\n Node name: 'de-fra-node9', resource name: 'pvc-b1cb2468-4c1b-4872-8040-9e9860c45c76',
volume number: 0\n Node name: 'de-fra-node9', resource name: 'pvc-cabb808d-d90b-4c59-831f-a3f08420effc',
volume number: 0\nNode: de-fra-node9, Storage pool name: lvm-thin'; Correction:
'Delete the listed volumes and snapshot-volumes first.'; Reports: '[61F976B7-00000-071670]'"
- "Message: 'The specified storage pool 'lvm-thin' on node 'de-fra-node8' can not
be deleted as volumes / snapshot-volumes are still using it.'; Details: 'Volumes
/ snapshot-volumes that are still using the storage pool: \n Node name: 'de-fra-node8',
resource name: 'pvc-1e59589f-e04e-4aee-a1c6-0561a764a7e8', volume number: 0\n
\ Node name: 'de-fra-node8', resource name: 'pvc-2c8ea040-9651-4501-a0d9-7b3920c82ec8',
volume number: 0\n Node name: 'de-fra-node8', resource name: 'pvc-4506c792-6fbb-43b6-a6ca-745084259e0d',
volume number: 0\n Node name: 'de-fra-node8', resource name: 'pvc-4e6cf7f8-2a42-4129-9a9b-cba310b3ed9e',
volume number: 0\n Node name: 'de-fra-node8', resource name: 'pvc-7afac331-4319-4ca7-b587-1ec267dc63b8',
volume number: 0\n Node name: 'de-fra-node8', resource name: 'pvc-7c07be83-43aa-44af-b9c6-c600339ad6a8',
volume number: 0\n Node name: 'de-fra-node8', resource name: 'pvc-8366a898-20ca-4f71-abee-cc3c40ff8bf1',
volume number: 0\n Node name: 'de-fra-node8', resource name: 'pvc-a2bff230-7ca6-4e93-a273-15d217967def',
volume number: 0\n Node name: 'de-fra-node8', resource name: 'pvc-b1cb2468-4c1b-4872-8040-9e9860c45c76',
volume number: 0\n Node name: 'de-fra-node8', resource name: 'pvc-cabb808d-d90b-4c59-831f-a3f08420effc',
volume number: 0\nNode: de-fra-node8, Storage pool name: lvm-thin'; Correction:
'Delete the listed volumes and snapshot-volumes first.'; Reports: '[61F976B7-00000-071671]'"
- "Message: 'The specified storage pool 'lvm-thin' on node 'de-fra-node10' can not
be deleted as volumes / snapshot-volumes are still using it.'; Details: 'Volumes
/ snapshot-volumes that are still using the storage pool: \n Node name: 'de-fra-node10',
resource name: 'pvc-1e59589f-e04e-4aee-a1c6-0561a764a7e8', volume number: 0\n
\ Node name: 'de-fra-node10', resource name: 'pvc-2c8ea040-9651-4501-a0d9-7b3920c82ec8',
volume number: 0\n Node name: 'de-fra-node10', resource name: 'pvc-4506c792-6fbb-43b6-a6ca-745084259e0d',
volume number: 0\n Node name: 'de-fra-node10', resource name: 'pvc-4e6cf7f8-2a42-4129-9a9b-cba310b3ed9e',
volume number: 0\n Node name: 'de-fra-node10', resource name: 'pvc-7afac331-4319-4ca7-b587-1ec267dc63b8',
volume number: 0\n Node name: 'de-fra-node10', resource name: 'pvc-7c07be83-43aa-44af-b9c6-c600339ad6a8',
volume number: 0\n Node name: 'de-fra-node10', resource name: 'pvc-8366a898-20ca-4f71-abee-cc3c40ff8bf1',
volume number: 0\n Node name: 'de-fra-node10', resource name: 'pvc-a2bff230-7ca6-4e93-a273-15d217967def',
volume number: 0\n Node name: 'de-fra-node10', resource name: 'pvc-b1cb2468-4c1b-4872-8040-9e9860c45c76',
volume number: 0\n Node name: 'de-fra-node10', resource name: 'pvc-cabb808d-d90b-4c59-831f-a3f08420effc',
volume number: 0\nNode: de-fra-node10, Storage pool name: lvm-thin'; Correction:
'Delete the listed volumes and snapshot-volumes first.'; Reports: '[61F976B7-00000-071672]'"```
Is it maybe a problem that the storage pools haven't be announced in the values file for the new CRD configuration?
Yes. If they are not set on helm upgrade, helm just removes them, and then the operator tries to delete the storage pool in a loop. It's a bit cumbersome, I know, but that's what helm does :shrug:
So you should edit the LinstorSatelliteSet to say:
spec:
storagePools:
lvmThinPools:
- name: lvm-thin
volumeGroup: vg-pool
thinVolume: disk-redundant
And also save that to your helm overrides for the next upgrade.
Awesome, thanks for your help @WanzenBug , that actually resolved the problem. Our expectation was that the new CRD would get merged with the existing information in etcd, so we skipped the setting in the values.yaml. Maybe this should be added to
- release notes
- https://github.com/piraeusdatastore/piraeus-operator/blob/v1.7.0/UPGRADE.md
- values.yaml