linstor-server icon indicating copy to clipboard operation
linstor-server copied to clipboard

Resources stuck on DELETING when one of the nodes is not available

Open kvaps opened this issue 4 years ago • 9 comments

Node m1c12 is currently broken:

+-------------------------------------------------------------------------------------+
| ResourceName                             | Node  | Port | Usage  | Conns |    State |
|=====================================================================================|
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c12 | 7366 |        |       |  Unknown |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c4  | 7366 | Unused |       | UpToDate |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c5  | 7366 |        |       | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c6  | 7366 |        |       | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c9  | 7366 |        |       | DELETING |
+-------------------------------------------------------------------------------------+

If I create some diskless replica on new node:

# linstor r c m1c7 pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a --diskless
SUCCESS:
Description:
    New resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on node 'm1c7' registered.
Details:
    Resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on node 'm1c7' UUID is: d3258e98-4c0b-4498-9f1c-6126923c769d
SUCCESS:
Description:
    Volume with number '0' on resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on node 'm1c7' successfully registered
Details:
    Volume UUID is: aed70f54-f747-4b0f-a6d1-53fbf2101746
WARNING:
Description:
    No active connection to satellite 'm1c12'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
SUCCESS:
    Added peer(s) 'm1c7' to resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c9'
SUCCESS:
    Added peer(s) 'm1c7' to resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c6'
SUCCESS:
    Added peer(s) 'm1c7' to resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c5'
SUCCESS:
    Created resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c7'
SUCCESS:
    Added peer(s) 'm1c7' to resource 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c4'

If will be placed without any problems:

+-------------------------------------------------------------------------------------+
| ResourceName                             | Node  | Port | Usage  | Conns |    State |
|=====================================================================================|
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c12 | 7366 |        |       |  Unknown |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c4  | 7366 | Unused |       | UpToDate |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c5  | 7366 |        |       | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c6  | 7366 |        |       | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c7  | 7366 | Unused |       | Diskless |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c9  | 7366 |        |       | DELETING |
+-------------------------------------------------------------------------------------+

But if I will try to remove it:

SUCCESS:
Description:
    Node: m1c7, Resource: pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a marked for deletion.
Details:
    Node: m1c7, Resource: pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a UUID is: d3258e98-4c0b-4498-9f1c-6126923c769d
WARNING:
Description:
    No active connection to satellite 'm1c12'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
SUCCESS:
    Notified 'm1c6' that 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' is being deleted on Node(s): [m1c7]
SUCCESS:
    Notified 'm1c5' that 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' is being deleted on Node(s): [m1c7]
SUCCESS:
    Notified 'm1c9' that 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' is being deleted on Node(s): [m1c7]
SUCCESS:
    Deleted 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' on 'm1c7'
SUCCESS:
    Notified 'm1c4' that 'pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a' is being deleted on Node(s): [m1c7]

It will stuck on DELETING until node m1c12 back online

+-------------------------------------------------------------------------------------+
| ResourceName                             | Node  | Port | Usage  | Conns |    State |
|=====================================================================================|
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c12 | 7366 |        |       |  Unknown |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c4  | 7366 | Unused |       | UpToDate |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c5  | 7366 |        |       | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c6  | 7366 |        |       | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c7  | 7366 |        |       | DELETING |
| pvc-ac5f8c9c-d68b-4dd3-9258-fb05cfe5e05a | m1c9  | 7366 |        |       | DELETING |
+-------------------------------------------------------------------------------------+

Main problem is that this behavior making m1c5, m1c6, m1c7 and m1c9 not available for the next usage, because resource can't be placed there anymore, it is already there and stuck on DELETING state

kvaps avatar Mar 20 '20 15:03 kvaps

There were plans that a Controller would delete resources immediately if Satellites had never been informed that the resource exists (because then the Satellite obviously cannot have remains of the resource configured). This has not been implemented yet. For resources that were actually deployed by a Satellite, even if they are diskless, on the other hand, this behavior is intentional.

raltnoeder avatar Mar 23 '20 09:03 raltnoeder

I've got another resource which stuck on DELETING even after linstor node lost:

+---------------------------------------------------------------------------------------------------+
| ResourceName       | Node   | Port | Usage  | Conns                                    |    State |
|===================================================================================================|
| one-vm-7430-disk-0 | m10c32 | 7857 |        | Ok                                       | DELETING |
| one-vm-7430-disk-0 | m10c33 | 7857 |        | Ok                                       | DELETING |
| one-vm-7430-disk-0 | m13c41 | 7857 |        | Ok                                       | DELETING |
| one-vm-7430-disk-0 | m14c19 | 7857 | Unused | StandAlone(storage10-f181.dc1.wedos.org) | UpToDate |
| one-vm-7430-disk-0 | m15c30 | 7857 | InUse  | Ok                                       | Diskless |
+---------------------------------------------------------------------------------------------------+
root@m14c19 # drbdadm status one-vm-7430-disk-0
one-vm-7430-disk-0 role:Secondary
  disk:UpToDate
  m15c30 role:Primary
    peer-disk:Diskless
# linstor r d m10c32 one-vm-7430-disk-0
SUCCESS:
Description:
    Node: m10c32, Resource: one-vm-7430-disk-0 marked for deletion.
Details:
    Node: m10c32, Resource: one-vm-7430-disk-0 UUID is: ebad62ca-e845-4495-bcee-88f126c1dea6
SUCCESS:
    Notified 'm15c30' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]
SUCCESS:
    Deleted 'one-vm-7430-disk-0' on 'm10c32'
SUCCESS:
    Notified 'm10c33' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]
ERROR:
    (Node: 'm14c19') Failed to adjust DRBD resource one-vm-7430-disk-0
Show reports:
    linstor error-reports show 5E850A32-51DD2-000034
SUCCESS:
    Notified 'm13c41' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]

5E850A32-51DD2-000034.log

And after satellite restart:

root@linstor-controller-0:/# linstor r l -r one-vm-7430-disk-0
+----------------------------------------------------------------+
| ResourceName       | Node   | Port | Usage  | Conns |    State |
|================================================================|
| one-vm-7430-disk-0 | m10c32 | 7857 |        | Ok    | DELETING |
| one-vm-7430-disk-0 | m10c33 | 7857 |        | Ok    | DELETING |
| one-vm-7430-disk-0 | m13c41 | 7857 |        | Ok    | DELETING |
| one-vm-7430-disk-0 | m14c19 | 7857 | Unused | Ok    | UpToDate |
| one-vm-7430-disk-0 | m15c30 | 7857 | InUse  | Ok    | Diskless |
+----------------------------------------------------------------+
# linstor r d m10c32 one-vm-7430-disk-0
SUCCESS:
Description:
    Node: m10c32, Resource: one-vm-7430-disk-0 marked for deletion.
Details:
    Node: m10c32, Resource: one-vm-7430-disk-0 UUID is: ebad62ca-e845-4495-bcee-88f126c1dea6
SUCCESS:
    Notified 'm15c30' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]
SUCCESS:
    Deleted 'one-vm-7430-disk-0' on 'm10c32'
SUCCESS:
    Notified 'm10c33' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]
ERROR:
    (Node: 'm14c19') Failed to adjust DRBD resource one-vm-7430-disk-0
Show reports:
    linstor error-reports show 5E984D52-51DD2-000002
SUCCESS:
    Notified 'm13c41' that 'one-vm-7430-disk-0' is being deleted on Node(s): [m10c32]

5E984D52-51DD2-000002.log

kvaps avatar Apr 16 '20 12:04 kvaps

This behavior is also blocking kubernetes integration:

# linstor r l -r pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa
+--------------------------------------------------------------------------------------+
| ResourceName                             | Node  | Port  | Usage  | Conns |    State |
|======================================================================================|
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m6c22 | 55004 |        | Ok    | DELETING |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m6c30 | 55004 |        | Ok    | DELETING |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m9c2  | 55004 | Unused | Ok    | UpToDate |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m9c8  | 55004 | Unused | Ok    | Outdated |
+--------------------------------------------------------------------------------------+

# linstor n l -n m6c22 m6c30 m9c2 m9c8
+----------------------------------------------------------+
| Node  | NodeType  | Addresses                  | State   |
|==========================================================|
| m6c22 | SATELLITE | 10.36.129.22:3366 (PLAIN)  | Online  |
| m6c30 | SATELLITE | 10.36.129.30:3366 (PLAIN)  | OFFLINE |
| m9c2  | SATELLITE | 10.36.129.137:3366 (PLAIN) | Online  |
| m9c8  | SATELLITE | 10.36.129.143:3366 (PLAIN) | Online  |
+----------------------------------------------------------+
# kubectl get volumeattachments.storage.k8s.io| grep pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa
csi-f2e21eb24cf3f139f1ee784e409542754ff793f36a2937706cc225e6092a02ca   linstor.csi.linbit.com   pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa   m6c22    true       6m17s

The volume attachment can't be removed from the node with DELETING resource, so workload can't be moved somewhere else until you manually remove the volumeattachment

And even if you do, the resources are stuck on:

  Warning  FailedMount             3m5s (x7 over 3m50s)  kubelet, m16c33          MountVolume.SetUp failed for volume "pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = Internal desc = NodePublishVolume failed for pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa: checking for exclusive open failed: wrong medium type, check device health
root@m16c33:~# drbdadm status pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa
pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa role:Secondary
  disk:Diskless
  m9c2 connection:Connecting
  m9c8 role:Secondary
    peer-disk:Outdated

# linstor r l -r pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa
+--------------------------------------------------------------------------------------------------+
| ResourceName                             | Node   | Port  | Usage  | Conns            |    State |
|==================================================================================================|
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m16c33 | 55004 | Unused | Connecting(m9c2) | Diskless |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m6c22  | 55004 |        | Ok               | DELETING |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m6c30  | 55004 |        | Ok               | DELETING |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m9c2   | 55004 | Unused | Ok               | UpToDate |
| pvc-af0fc78a-b716-4958-9032-50e7d1ddb2fa | m9c8   | 55004 | Unused | Ok               | Outdated |
+--------------------------------------------------------------------------------------------------+

kvaps avatar May 22 '20 07:05 kvaps

Today we faced with this problem again.

m7c29 node become offline and Kubernetes were trying to restart the pod on another node, but didn't succeed because device was blocked on detaching:

  Warning  FailedAttachVolume  13m        attachdetach-controller  Multi-Attach error for volume "pvc-3c485a50-c357-4185-9fef-56980fb0726e" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedAttachVolume  13m        attachdetach-controller  Multi-Attach error for volume "pvc-71d7f0aa-2bc5-4410-a94f-17843ecad0eb" Volume is already exclusively attached to one node and can't be attached to another
╭─────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-71d7f0aa-2bc5-4410-a94f-17843ecad0eb ┊ m12c2 ┊ 8970 ┊ Unused ┊ Ok    ┊ UpToDate ┊           ┊
┊ pvc-71d7f0aa-2bc5-4410-a94f-17843ecad0eb ┊ m7c1  ┊ 8970 ┊ Unused ┊ Ok    ┊ UpToDate ┊           ┊
┊ pvc-71d7f0aa-2bc5-4410-a94f-17843ecad0eb ┊ m7c29 ┊ 8970 ┊        ┊ Ok    ┊ DELETING ┊           ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-3c485a50-c357-4185-9fef-56980fb0726e ┊ m5c18 ┊ 8971 ┊ Unused ┊ Ok    ┊ UpToDate ┊                     ┊
┊ pvc-3c485a50-c357-4185-9fef-56980fb0726e ┊ m7c29 ┊ 8971 ┊        ┊ Ok    ┊ DELETING ┊                     ┊
┊ pvc-3c485a50-c357-4185-9fef-56980fb0726e ┊ m7c36 ┊ 8971 ┊ Unused ┊ Ok    ┊ Diskless ┊ 2020-11-16 09:28:25 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯

kvaps avatar Nov 16 '20 09:11 kvaps

The main problem is, that every satellite has to confirm the deletion of a known resource. The "local" satellite obviously has to confirm deletion (as it is the one that is actually deleting the resource...). Although it sounds strange, but also "remote" satellites have to confirm the deletion of a resource as layers like DRBD need to perform cleanup operations like deleting and forgetting about the peer (resetting bitmap-tracking of changes to "day-zero").

Allowing to recreate a resource that is still in DELETING state is also not really a good idea / everything but trivial, as basically everything besides node-name and resource-name could change for that recreated resource, including layer-list, storage pool(s), properties and so on...

One possibility would be to ignore all new settings of the resource in DELETING state, and just "undo deletion", without creating a new resource from scratch with the same node-name and resource-name (here I am only talking about the Linstor-intern objects and data-structures. The already deleted LVs, DRBD resource, etc... obviously have to be recreated, but using the same data as before)

ghernadi avatar Nov 18 '20 13:11 ghernadi

I would like to little summarize the issue. If one node is not online then there is actually two problems:

  1. Kubernetes migration will not work since linstor-csi driver, which is actually just implements the external-attacher calls, awaiting for successful detachment of the resource:

    • https://github.com/piraeusdatastore/linstor-csi/blob/d2f23d6df7cefe22b3cfe450cc5a2df9e37efa10/pkg/driver/driver.go#L618
    • https://github.com/piraeusdatastore/linstor-csi/blob/d2f23d6df7cefe22b3cfe450cc5a2df9e37efa10/pkg/client/linstor.go#L460
    • https://github.com/LINBIT/golinstor/blob/d56e7f7cfa7d9a823640606f85a3d7345ce9eb4a/client/resource.go#L462

    Not sure why it was working previously. In OpenNebula I solved this by ignoring all errors on detaching.

  2. The reusing of the resource when it stuck on DELETING, eg.

  • you have two storage nodes: node1, node2 and two compute nodes node3, node4
  • resource deployed on node1, node2 (diskful replicas) and on node3 (diskless replica)

then two cases when you face with this problem:

  • you do undeploy workload from the node3 then deploy it back on node3
  • you do migration of workload from node3 to node4 then migration back from node4 to node3

kvaps avatar Nov 18 '20 15:11 kvaps

Although it sounds strange, but also "remote" satellites have to confirm the deletion of a resource

@ghernadi, could you please clarify why do we need the confirmation from the other "remote" satellites and what could be happen if immediately remove the resource from the local node without this confirmation?

layers like DRBD need to perform cleanup operations like deleting and forgetting about the peer (resetting bitmap-tracking of changes to "day-zero").

But they can do cleanup later, when the broken node return online, isn't it?

kvaps avatar Nov 18 '20 15:11 kvaps

could you please clarify why do we need the confirmation from the other "remote" satellites and what could be happen if immediately remove the resource from the local node without this confirmation?

If you have diskful thinly-provisioned resources on A and B, A goes offline and the resource on B gets deleted. If A does not perform the cleanup and the controller just releases the reserved data of the resource on node B, that data will be reused on next resource create. Right now I am talking about node-id's in particular but could also be true for others, maybe even from future features. So lets say you create a new resource on C (although B would also be free, we just choose to not use it for this scenario). C gets the same node-id as the already deleted B. When A comes back online again, obviously it updates its .res file connecting to C instead of B, but as the node-id are the same, the entire initial sync is skipped - not just the full-sync but also the partial sync from day-zero. In this case C would read garbage from its local disk until it gets finally updated from A (through a write operation on A). But the actual worst-case scenario would be if not A but C would become primary, and only updates small parts whereas the "untouched" part of the blocks would get replicated to A, effectively destroying your good data on A also.

But they can do cleanup later, when the broken node return online, isn't it?

Yes they can cleanup later, but Linstor needs some data to describe what needs updated. That data is the resource-object which stays in DELETING state :)

Don't get me wrong, I'm not trying to convince anyone here that the way it is is the only possibly solution and should not be touched. Solving something like this is just not that trivial as I'm afraid it could require changes of parts of the core-design of Linstor... But I still have to think about this more (when I find some time at least)

ghernadi avatar Nov 19 '20 09:11 ghernadi

Very clear, thank you! And now I am learning the amazing details of the DRBD implementation.

OK, now I understand the original problem. I just thinking how can we solve this with no changes in core design. And i think I have some ideas.

  • Can't we just exclude the diskless resources from this check? If I understand it correct they it does not participate in synchronization process, thus we can do not wait until node with the diskless resource become online and confirm the deletion. However you mentioned some future features, so I'm not sure if this solution is acceptable.

  • Another option would be to implement the check: if the offline node should confirm the deletion, or it already has no any connection with this resource from the beginning. Eg when deleting resource created after the node become offline.

Any way I like your idea with unrelenting the resources, I think this is working solution as well.

@ghernadi I'm not hurrying you to solve this problem at all, it was just interesting topic to decide to me. Thank you for your feedback!

kvaps avatar Nov 19 '20 10:11 kvaps