longhorn [REFACTOR] volume controller refactoring/split up, to simplify the control flow

The behavior is changed so that attachment is handled solely based on v.Spec.NodeID instead of handling attachment internally via setting v.Status.CurrentNode or v.Status.PendingNodeID

This requires dedicated controllers for operations that previously relied on the internal attachment, the required controllers are:

SalvageController for processing replica failures
ExpansionController for processing volume/engine/replica size changes
BackupController for processing DR volumes as well as restore operations

This also changes the behavior so that:

v.Status.CurrentNodeID is only set once we are fully attached
v.Status.CurrentNodeID is only unset once we are fully detached

This is so that the status represents a resting state (i.e. attached/detached) since the transition states are based on the v.Spec.NodeID

Apr 22 '21 20:04 joshimoo

Hey team! Please add your planning poker estimate with ZenHub @jenting @PhanLe1010 @shuo-wu @joshimoo

Oct 21 '21 02:10 innobead

One alternative in my mind is that we can launch a new CRD similar to VolumeAttachment. Then all attachment/detachment-related info(e.g., if the volume is attached by the system) would be stored in the CR status.

Oct 26 '21 12:10 shuo-wu

Leave a note here about a race condition that I think should be solved after doing this refactoring.

Initially, the problematic volume was on node us1sxlxk80262 in the user's cluster
At 2021-11-09T23:30:40.164647072Z, all replicas of the volumes failed (may be due to network issue): 2021-11-09T23:30:40.164647072Z time="2021-11-09T23:30:40Z" level=info msg="All replicas are failed, set engine salvageRequested to true" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=us1sxlxk80262 owner=us1sxlxk80262 state=detached volume=pvc-a40ce347-397d-4e4a-8247-17b9c34733d4
Longhorn set remountRequested 2021-11-09T23:30:40.164678769Z time="2021-11-09T23:30:40Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-a40ce347-397d-4e4a-8247-17b9c34733d4\", UID:\"8a1d05ac-0a70-4665-8591-a23ae599f21a\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"361269172\", FieldPath:\"\"}): type: 'Normal' reason: 'Remount' Volume pvc-a40ce347-397d-4e4a-8247-17b9c34733d4 requested remount at 2021-11-09T23:30:40Z" Then try to delete the workload pod so Kubernetes can restart them.
In a normal pod deletion process, CSI plugin will clear the volume.Spec.NodeID. Longhorn try to reattach volume because the pendingNodeId was set. Then Longhorn detaches the volume because the volume.Spec.NodeID is cleared.
I am guessing in the middle of this process, CSI plugin sent an attach call 2021-11-09T23:30:58.673536017Z time="2021-11-09T23:30:58Z" level=info msg="Volume pvc-a40ce347-397d-4e4a-8247-17b9c34733d4 attachment to us1sxlxk80255 with disableFrontend false requested".
The end result is, volume.Spec.NodeID=us1sxlxk80255 but the volume.Status.CurrentNodeID=us1sxlxk80262 . This is an error state and volume controller can never recover 2021-11-10T18:32:13.493947054Z time="2021-11-10T18:32:13Z" level=warning msg="Dropping Longhorn volume longhorn-system/pvc-a40ce347-397d-4e4a-8247-17b9c34733d4 out of the queue" controller=longhorn-volume error="fail to sync longhorn-system/pvc-a40ce347-397d-4e4a-8247-17b9c34733d4: fail to reconcile volume state for pvc-a40ce347-397d-4e4a-8247-17b9c34733d4: volume pvc-a40ce347-397d-4e4a-8247-17b9c34733d4 has already attached to node us1sxlxk80262, but asked to attach to node us1sxlxk80255" node=us1sxlxk80255 See the code https://github.com/longhorn/longhorn-manager/blob/ac065a8deb96193ffec142bcc7e7345615b57b8d/controller/volume_controller.go#L869

Nov 11 '21 20:11 PhanLe1010

Moving to the next sprint, because @joshimoo is working on mTLS implementation instead.

Dec 06 '21 12:12 innobead

@keithalucas has some idea about if we can come out with a workaround for this as a short-term solution.

Mar 09 '22 16:03 innobead

I don't know if there is a scenario in which we generate the error here and this is actually the case that we are trying to start the longhorn engine on the new node and it is actually attached to the node v.Status.CurrentNodeID. We are seeing it after every node in a cluster is restarted and the CurrentNodeID was set before it was rebooted.

Given the problem that Phan described above and an issue with the same "already attached to node" log message with a customer that reboots all the nodes in a cluster, I am curious to see if we can continue with ReconcileVolumeState function instead of returning an error. This could fix some of the issues before we refactor things.

Mar 09 '22 20:03 keithalucas

Pre Ready-For-Testing Checklist

[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at: https://github.com/longhorn/longhorn/issues/2527#issuecomment-1568934161
[x] ~~Is there a workaround for the issue? If so, where is it documented? The workaround is at:~~
[x] Does the PR include the explanation for the fix or the feature?
[x] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at: https://github.com/longhorn/longhorn/pull/5940
[x] Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)? The PR is at https://github.com/longhorn/longhorn-manager/pull/1541
[x] Which areas/issues this PR might have potential impacts on? Area Attach/Detach operations Issues
[x] ~~If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted? The LEP PR is at~~
[x] ~~If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)? The UI issue/PR is at~~
[x] ~~If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)? The documentation issue/PR is at~~
[x] ~~If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*) The automation skeleton PR is at The automation test case PR is at The issue of automation test case implementation is at (please create by the template)~~
[x] ~~If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)? The engine automation PR is at~~
[x] ~~If labeled: require/manual-test-plan Has the manual test plan been documented? The updated manual test plan is at~~
[x] ~~If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility? The compatibility issue is filed at~~

May 30 '23 17:05 longhorn-io-github-bot

Test Plan:

Crash volumes

Create a deployment using Longhorn volume
Write some data to the Longhorn volume's mount point and calculate the checksum
Crash all replicas of the volume
Observe that the volume is auto-salvaged and auto reattached
Observe that Longhorn kills the workload pod to remount the volume
Verify that eventually workload pod becomes running. Read/Write to Longhorn volume is ok. Verify the checksum

Crash all nodes

Create a deployment and statefulset using Longhorn volumes. Around 10 Longhorn volumes
Reboot all nodes in the cluster near the same time
Observe that the volumes are auto-salvaged and auto reattached
Observe that Longhorn kills the workload pods to remount the volume
Verify that eventually workload pods becomes running

Hacky test that verifies that the volume reconciliation loop is not stuck even if `volume.spec.nodeID` != `volume.status.currentNodeID`

Create a Longhorn volume
Use Longhorn UI to attach the volume to node-1. Wait for volume to become healthy
Edit the corresponding VolumeAttachment object to add one more attachment ticket with attachmentID: manual-ticket that selects node-2
Observe that the ticket manual-ticket cannot be satisfied because the longhorn-ui ticket already request the volume attached to node-1
kubectl edit <VOLUME-NAME> -n longhorn-system to change volume.spec.nodeID from node-1 to node-2
Verify that volume detached from node-1 and attached to node-2. It means that volume is not stuck even if volume.spec.nodeID != volume.status.currentNodeID

May 30 '23 19:05 PhanLe1010

Verified pass in longhorn master(longhorn c81ddd longhorn-manager 7fa846) with test steps

[x] Crash volumes : After all replica crashed, volume auto-salvaged and auto reattached, pod data intact.
[x] Crash all nodes : After all nodes reboot ready, volume auto-salvaged and auto reattached, all pods data intact.
[x] Hacky test that verifies that the volume reconciliation loop is not stuck even if volume.spec.nodeID != volume.status.currentNodeID

Follow steps add attachmentTickets with attachmentID: manual-ticket, can see the Satisfied status is false

Volumeattachment before modify volume.spec.nodeID

apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
  creationTimestamp: "2023-06-02T09:36:14Z"
  finalizers:
  - longhorn.io
  generation: 3
  labels:
    longhornvolume: vol1
  name: vol1
  namespace: longhorn-system
  ownerReferences:
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
    name: vol1
    uid: 81c91ab3-8f25-4f45-9bd9-3244fb8e7cf1
  resourceVersion: "11665"
  uid: d9e1bc35-cfb8-4208-a3f7-9fd3f127e0ed
spec:
  attachmentTickets:
    "":
      generation: 0
      id: ""
      nodeID: ip-172-31-18-66
      parameters:
        disableFrontend: "false"
        lastAttachedBy: ""
      type: longhorn-api
    manual-ticket:
      generation: 1
      id: manual-ticket
      nodeID: ip-172-31-19-6
      parameters:
        disableFrontend: "false"
        lastAttachedBy: ""
      type: longhorn-api
  volume: vol1
status:
  attachmentTicketStatuses:
    "":
      conditions:
      - lastProbeTime: ""
        lastTransitionTime: "2023-06-02T09:36:25Z"
        message: ""
        reason: ""
        status: "True"
        type: Satisfied
      generation: 0
      id: ""
      satisfied: true
    manual-ticket:
      conditions:
      - lastProbeTime: ""
        lastTransitionTime: "2023-06-02T09:37:27Z"
        message: 'the volume is currently attached to different node ip-172-31-18-66 '
        reason: ""
        status: "False"
        type: Satisfied
      generation: 1
      id: manual-ticket
      satisfied: false

After modify volume.spec.nodeID to node 2 (ip-172-31-19-6), can see the Satisfied status of manual-ticket is true and volume attached to node 2 healthy

Volumeattachment after modify volume.spec.nodeID

apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
  creationTimestamp: "2023-06-02T09:36:14Z"
  finalizers:
  - longhorn.io
  generation: 3
  labels:
    longhornvolume: vol1
  name: vol1
  namespace: longhorn-system
  ownerReferences:
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
    name: vol1
    uid: 81c91ab3-8f25-4f45-9bd9-3244fb8e7cf1
  resourceVersion: "12439"
  uid: d9e1bc35-cfb8-4208-a3f7-9fd3f127e0ed
spec:
  attachmentTickets:
    "":
      generation: 0
      id: ""
      nodeID: ip-172-31-18-66
      parameters:
        disableFrontend: "false"
        lastAttachedBy: ""
      type: longhorn-api
    manual-ticket:
      generation: 1
      id: manual-ticket
      nodeID: ip-172-31-19-6
      parameters:
        disableFrontend: "false"
        lastAttachedBy: ""
      type: longhorn-api
  volume: vol1
status:
  attachmentTicketStatuses:
    "":
      conditions:
      - lastProbeTime: ""
        lastTransitionTime: "2023-06-02T09:42:43Z"
        message: 'the volume is currently attached to different node ip-172-31-19-6 '
        reason: ""
        status: "False"
        type: Satisfied
      generation: 0
      id: ""
      satisfied: false
    manual-ticket:
      conditions:
      - lastProbeTime: ""
        lastTransitionTime: "2023-06-02T09:42:49Z"
        message: ""
        reason: ""
        status: "True"
        type: Satisfied
      generation: 1
      id: manual-ticket
      satisfied: true
root@ip-172-31-18-63:/hom

Jun 02 '23 10:06 chriscchien

longhorn longhorn copied to clipboard

[REFACTOR] volume controller refactoring/split up, to simplify the control flow

Pre Ready-For-Testing Checklist

Test Plan:

Crash volumes

Crash all nodes

Hacky test that verifies that the volume reconciliation loop is not stuck even if volume.spec.nodeID != volume.status.currentNodeID

longhorn
longhorn copied to clipboard

Hacky test that verifies that the volume reconciliation loop is not stuck even if `volume.spec.nodeID` != `volume.status.currentNodeID`