linstor-server
linstor-server copied to clipboard
Resources stuck on DELETING when one of the nodes is not available
Node m1c12
is currently broken:
+-------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State |
|=====================================================================================|
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c12 | 7366 | | | Unknown |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c4 | 7366 | Unused | | UpToDate |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c5 | 7366 | | | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c6 | 7366 | | | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c9 | 7366 | | | DELETING |
+-------------------------------------------------------------------------------------+
If I create some diskless replica on new node:
# linstor r c m1c7 pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a --diskless
SUCCESS:
Description:
New resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on node 'm1c7' registered.
Details:
Resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on node 'm1c7' UUID is: d3258e98-4c0b-4498-9f1c-6126923c769d
SUCCESS:
Description:
Volume with number '0' on resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on node 'm1c7' successfully registered
Details:
Volume UUID is: aed70f54-f747-4b0f-a6d1-53fbf2101746
WARNING:
Description:
No active connection to satellite 'm1c12'
Details:
The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
SUCCESS:
Added peer(s) 'm1c7' to resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c9'
SUCCESS:
Added peer(s) 'm1c7' to resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c6'
SUCCESS:
Added peer(s) 'm1c7' to resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c5'
SUCCESS:
Created resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c7'
SUCCESS:
Added peer(s) 'm1c7' to resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c4'
If will be placed without any problems:
+-------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State |
|=====================================================================================|
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c12 | 7366 | | | Unknown |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c4 | 7366 | Unused | | UpToDate |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c5 | 7366 | | | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c6 | 7366 | | | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c7 | 7366 | Unused | | Diskless |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c9 | 7366 | | | DELETING |
+-------------------------------------------------------------------------------------+
But if I will try to remove it:
SUCCESS:
Description:
Node: m1c7, Resource: pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a marked for deletion.
Details:
Node: m1c7, Resource: pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a UUID is: d3258e98-4c0b-4498-9f1c-6126923c769d
WARNING:
Description:
No active connection to satellite 'm1c12'
Details:
The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
SUCCESS:
Notified 'm1c6' that 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' is being deleted on Node(s): [m1c7]
SUCCESS:
Notified 'm1c5' that 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' is being deleted on Node(s): [m1c7]
SUCCESS:
Notified 'm1c9' that 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' is being deleted on Node(s): [m1c7]
SUCCESS:
Deleted 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c7'
SUCCESS:
Notified 'm1c4' that 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' is being deleted on Node(s): [m1c7]
It will stuck on DELETING
until node m1c12
back online
+-------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State |
|=====================================================================================|
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c12 | 7366 | | | Unknown |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c4 | 7366 | Unused | | UpToDate |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c5 | 7366 | | | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c6 | 7366 | | | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c7 | 7366 | | | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c9 | 7366 | | | DELETING |
+-------------------------------------------------------------------------------------+
Main problem is that this behavior making m1c5
, m1c6
, m1c7
and m1c9
not available for the next usage, because resource can't be placed there anymore, it is already there and stuck on DELETING
state
There were plans that a Controller would delete resources immediately if Satellites had never been informed that the resource exists (because then the Satellite obviously cannot have remains of the resource configured). This has not been implemented yet. For resources that were actually deployed by a Satellite, even if they are diskless, on the other hand, this behavior is intentional.
I've got another resource which stuck on DELETING
even after linstor node lost
:
+---------------------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State |
|===================================================================================================|
| one-vm-7430-disk-0 | m10c32 | 7857 | | Ok | DELETING |
| one-vm-7430-disk-0 | m10c33 | 7857 | | Ok | DELETING |
| one-vm-7430-disk-0 | m13c41 | 7857 | | Ok | DELETING |
| one-vm-7430-disk-0 | m14c19 | 7857 | Unused | StandAlone(storage10-f181.dc1.wedos.org) | UpToDate |
| one-vm-7430-disk-0 | m15c30 | 7857 | InUse | Ok | Diskless |
+---------------------------------------------------------------------------------------------------+
root@m14c19 # drbdadm status one-vm-7430-disk-0
one-vm-7430-disk-0 role:Secondary
disk:UpToDate
m15c30 role:Primary
peer-disk:Diskless
# linstor r d m10c32 one-vm-7430-disk-0
SUCCESS:
Description:
Node: m10c32, Resource: one-vm-7430-disk-0 marked for deletion.
Details:
Node: m10c32, Resource: one-vm-7430-disk-0 UUID is: ebad62ca-e845-4495-bcee-88f126c1dea6
SUCCESS:
Notified 'm15c30' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]
SUCCESS:
Deleted 'one-vm-7430-disk-0' on 'm10c32'
SUCCESS:
Notified 'm10c33' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]
ERROR:
(Node: 'm14c19') Failed to adjust DRBD resource one-vm-7430-disk-0
Show reports:
linstor error-reports show 5E850A32-51DD2-000034
SUCCESS:
Notified 'm13c41' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]
And after satellite restart:
root@linstor-controller-0:/# linstor r l -r one-vm-7430-disk-0
+----------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State |
|================================================================|
| one-vm-7430-disk-0 | m10c32 | 7857 | | Ok | DELETING |
| one-vm-7430-disk-0 | m10c33 | 7857 | | Ok | DELETING |
| one-vm-7430-disk-0 | m13c41 | 7857 | | Ok | DELETING |
| one-vm-7430-disk-0 | m14c19 | 7857 | Unused | Ok | UpToDate |
| one-vm-7430-disk-0 | m15c30 | 7857 | InUse | Ok | Diskless |
+----------------------------------------------------------------+
# linstor r d m10c32 one-vm-7430-disk-0
SUCCESS:
Description:
Node: m10c32, Resource: one-vm-7430-disk-0 marked for deletion.
Details:
Node: m10c32, Resource: one-vm-7430-disk-0 UUID is: ebad62ca-e845-4495-bcee-88f126c1dea6
SUCCESS:
Notified 'm15c30' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]
SUCCESS:
Deleted 'one-vm-7430-disk-0' on 'm10c32'
SUCCESS:
Notified 'm10c33' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]
ERROR:
(Node: 'm14c19') Failed to adjust DRBD resource one-vm-7430-disk-0
Show reports:
linstor error-reports show 5E984D52-51DD2-000002
SUCCESS:
Notified 'm13c41' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]
This behavior is also blocking kubernetes integration:
# linstor r l -r pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa
+--------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State |
|======================================================================================|
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m6c22 | 55004 | | Ok | DELETING |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m6c30 | 55004 | | Ok | DELETING |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m9c2 | 55004 | Unused | Ok | UpToDate |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m9c8 | 55004 | Unused | Ok | Outdated |
+--------------------------------------------------------------------------------------+
# linstor n l -n m6c22 m6c30 m9c2 m9c8
+----------------------------------------------------------+
| Node | NodeType | Addresses | State |
|==========================================================|
| m6c22 | SATELLITE | 10.36.129.22:3366 (PLAIN) | Online |
| m6c30 | SATELLITE | 10.36.129.30:3366 (PLAIN) | OFFLINE |
| m9c2 | SATELLITE | 10.36.129.137:3366 (PLAIN) | Online |
| m9c8 | SATELLITE | 10.36.129.143:3366 (PLAIN) | Online |
+----------------------------------------------------------+
# kubectl get volumeattachments.storage.k8s.io| grep pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa
csi-f2e21eb24cf3f139f1ee784e409542754ff793f36a2937706cc225e6092a02ca linstor.csi.linbit.com pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa m6c22 true 6m17s
The volume attachment can't be removed from the node with DELETING resource, so workload can't be moved somewhere else until you manually remove the volumeattachment
And even if you do, the resources are stuck on:
Warning FailedMount 3m5s (x7 over 3m50s) kubelet, m16c33 MountVolume.SetUp failed for volume "pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = Internal desc = NodePublishVolume failed for pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa: checking for exclusive open failed: wrong medium type, check device health
root@m16c33:~# drbdadm status pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa
pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa role:Secondary
disk:Diskless
m9c2 connection:Connecting
m9c8 role:Secondary
peer-disk:Outdated
# linstor r l -r pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa
+--------------------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State |
|==================================================================================================|
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m16c33 | 55004 | Unused | Connecting(m9c2) | Diskless |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m6c22 | 55004 | | Ok | DELETING |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m6c30 | 55004 | | Ok | DELETING |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m9c2 | 55004 | Unused | Ok | UpToDate |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m9c8 | 55004 | Unused | Ok | Outdated |
+--------------------------------------------------------------------------------------------------+
Today we faced with this problem again.
m7c29
node become offline and Kubernetes were trying to restart the pod on another node, but didn't succeed because device was blocked on detaching:
Warning FailedAttachVolume 13m attachdetach-controller Multi-Attach error for volume "pvc-3c485a50-c357-4185-9fef-56980fb0726e" Volume is already exclusively attached to one node and can't be attached to another
Warning FailedAttachVolume 13m attachdetach-controller Multi-Attach error for volume "pvc-71d7f0aa-2bc5-4410-a94f-17843ecad0eb" Volume is already exclusively attached to one node and can't be attached to another
╭─────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-71d7f0aa-2bc5-4410-a94f-17843ecad0eb ┊ m12c2 ┊ 8970 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
┊ pvc-71d7f0aa-2bc5-4410-a94f-17843ecad0eb ┊ m7c1 ┊ 8970 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
┊ pvc-71d7f0aa-2bc5-4410-a94f-17843ecad0eb ┊ m7c29 ┊ 8970 ┊ ┊ Ok ┊ DELETING ┊ ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-3c485a50-c357-4185-9fef-56980fb0726e ┊ m5c18 ┊ 8971 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
┊ pvc-3c485a50-c357-4185-9fef-56980fb0726e ┊ m7c29 ┊ 8971 ┊ ┊ Ok ┊ DELETING ┊ ┊
┊ pvc-3c485a50-c357-4185-9fef-56980fb0726e ┊ m7c36 ┊ 8971 ┊ Unused ┊ Ok ┊ Diskless ┊ 2020-11-16 09:28:25 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
The main problem is, that every satellite has to confirm the deletion of a known resource. The "local" satellite obviously has to confirm deletion (as it is the one that is actually deleting the resource...). Although it sounds strange, but also "remote" satellites have to confirm the deletion of a resource as layers like DRBD need to perform cleanup operations like deleting and forgetting about the peer (resetting bitmap-tracking of changes to "day-zero").
Allowing to recreate a resource that is still in DELETING
state is also not really a good idea / everything but trivial, as basically everything besides node-name and resource-name could change for that recreated resource, including layer-list, storage pool(s), properties and so on...
One possibility would be to ignore all new settings of the resource in DELETING
state, and just "undo deletion", without creating a new resource from scratch with the same node-name and resource-name (here I am only talking about the Linstor-intern objects and data-structures. The already deleted LVs, DRBD resource, etc... obviously have to be recreated, but using the same data as before)
I would like to little summarize the issue. If one node is not online then there is actually two problems:
-
Kubernetes migration will not work since linstor-csi driver, which is actually just implements the external-attacher calls, awaiting for successful detachment of the resource:
- https://github.com/piraeusdatastore/linstor-csi/blob/d2f23d6df7cefe22b3cfe450cc5a2df9e37efa10/pkg/driver/driver.go#L618
- https://github.com/piraeusdatastore/linstor-csi/blob/d2f23d6df7cefe22b3cfe450cc5a2df9e37efa10/pkg/client/linstor.go#L460
- https://github.com/LINBIT/golinstor/blob/d56e7f7cfa7d9a823640606f85a3d7345ce9eb4a/client/resource.go#L462
Not sure why it was working previously. In OpenNebula I solved this by ignoring all errors on detaching.
-
The reusing of the resource when it stuck on
DELETING
, eg.
- you have two storage nodes:
node1
,node2
and two compute nodesnode3
,node4
- resource deployed on
node1
,node2
(diskful replicas) and onnode3
(diskless replica)
then two cases when you face with this problem:
- you do undeploy workload from the
node3
then deploy it back onnode3
- you do migration of workload from
node3
tonode4
then migration back fromnode4
tonode3
Although it sounds strange, but also "remote" satellites have to confirm the deletion of a resource
@ghernadi, could you please clarify why do we need the confirmation from the other "remote" satellites and what could be happen if immediately remove the resource from the local node without this confirmation?
layers like DRBD need to perform cleanup operations like deleting and forgetting about the peer (resetting bitmap-tracking of changes to "day-zero").
But they can do cleanup later, when the broken node return online, isn't it?
could you please clarify why do we need the confirmation from the other "remote" satellites and what could be happen if immediately remove the resource from the local node without this confirmation?
If you have diskful thinly-provisioned resources on A and B, A goes offline and the resource on B gets deleted. If A does not perform the cleanup and the controller just releases the reserved data of the resource on node B, that data will be reused on next resource create. Right now I am talking about node-id's in particular but could also be true for others, maybe even from future features. So lets say you create a new resource on C (although B would also be free, we just choose to not use it for this scenario). C gets the same node-id as the already deleted B. When A comes back online again, obviously it updates its .res file connecting to C instead of B, but as the node-id are the same, the entire initial sync is skipped - not just the full-sync but also the partial sync from day-zero. In this case C would read garbage from its local disk until it gets finally updated from A (through a write operation on A). But the actual worst-case scenario would be if not A but C would become primary, and only updates small parts whereas the "untouched" part of the blocks would get replicated to A, effectively destroying your good data on A also.
But they can do cleanup later, when the broken node return online, isn't it?
Yes they can cleanup later, but Linstor needs some data to describe what needs updated. That data is the resource-object which stays in DELETING
state :)
Don't get me wrong, I'm not trying to convince anyone here that the way it is is the only possibly solution and should not be touched. Solving something like this is just not that trivial as I'm afraid it could require changes of parts of the core-design of Linstor... But I still have to think about this more (when I find some time at least)
Very clear, thank you! And now I am learning the amazing details of the DRBD implementation.
OK, now I understand the original problem. I just thinking how can we solve this with no changes in core design. And i think I have some ideas.
-
Can't we just exclude the diskless resources from this check? If I understand it correct they it does not participate in synchronization process, thus we can do not wait until node with the diskless resource become online and confirm the deletion. However you mentioned some future features, so I'm not sure if this solution is acceptable.
-
Another option would be to implement the check: if the offline node should confirm the deletion, or it already has no any connection with this resource from the beginning. Eg when deleting resource created after the node become offline.
Any way I like your idea with unrelenting the resources, I think this is working solution as well.
@ghernadi I'm not hurrying you to solve this problem at all, it was just interesting topic to decide to me. Thank you for your feedback!