longhorn
longhorn copied to clipboard
[REFACTOR] volume controller refactoring/split up, to simplify the control flow
The behavior is changed so that attachment is handled solely based on v.Spec.NodeID instead of handling attachment internally via setting v.Status.CurrentNode or v.Status.PendingNodeID
This requires dedicated controllers for operations that previously relied on the internal attachment, the required controllers are:
- SalvageController for processing replica failures
- ExpansionController for processing volume/engine/replica size changes
- BackupController for processing DR volumes as well as restore operations
This also changes the behavior so that:
- v.Status.CurrentNodeID is only set once we are fully attached
- v.Status.CurrentNodeID is only unset once we are fully detached
This is so that the status represents a resting state (i.e. attached/detached) since the transition states are based on the v.Spec.NodeID
Hey team! Please add your planning poker estimate with ZenHub @jenting @PhanLe1010 @shuo-wu @joshimoo
One alternative in my mind is that we can launch a new CRD similar to VolumeAttachment. Then all attachment/detachment-related info(e.g., if the volume is attached by the system) would be stored in the CR status.
Leave a note here about a race condition that I think should be solved after doing this refactoring.
- Initially, the problematic volume was on node
us1sxlxk80262
in the user's cluster - At
2021-11-09T23:30:40.164647072Z
, all replicas of the volumes failed (may be due to network issue):2021-11-09T23:30:40.164647072Z time="2021-11-09T23:30:40Z" level=info msg="All replicas are failed, set engine salvageRequested to true" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=us1sxlxk80262 owner=us1sxlxk80262 state=detached volume=pvc-a40ce347-397d-4e4a-8247-17b9c34733d4
- Longhorn set remountRequested
2021-11-09T23:30:40.164678769Z time="2021-11-09T23:30:40Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-a40ce347-397d-4e4a-8247-17b9c34733d4\", UID:\"8a1d05ac-0a70-4665-8591-a23ae599f21a\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"361269172\", FieldPath:\"\"}): type: 'Normal' reason: 'Remount' Volume pvc-a40ce347-397d-4e4a-8247-17b9c34733d4 requested remount at 2021-11-09T23:30:40Z"
Then try to delete the workload pod so Kubernetes can restart them. - In a normal pod deletion process, CSI plugin will clear the volume.Spec.NodeID. Longhorn try to reattach volume because the pendingNodeId was set. Then Longhorn detaches the volume because the volume.Spec.NodeID is cleared.
- I am guessing in the middle of this process, CSI plugin sent an attach call
2021-11-09T23:30:58.673536017Z time="2021-11-09T23:30:58Z" level=info msg="Volume pvc-a40ce347-397d-4e4a-8247-17b9c34733d4 attachment to us1sxlxk80255 with disableFrontend false requested".
- The end result is,
volume.Spec.NodeID=us1sxlxk80255
but thevolume.Status.CurrentNodeID=us1sxlxk80262
. This is an error state and volume controller can never recover2021-11-10T18:32:13.493947054Z time="2021-11-10T18:32:13Z" level=warning msg="Dropping Longhorn volume longhorn-system/pvc-a40ce347-397d-4e4a-8247-17b9c34733d4 out of the queue" controller=longhorn-volume error="fail to sync longhorn-system/pvc-a40ce347-397d-4e4a-8247-17b9c34733d4: fail to reconcile volume state for pvc-a40ce347-397d-4e4a-8247-17b9c34733d4: volume pvc-a40ce347-397d-4e4a-8247-17b9c34733d4 has already attached to node us1sxlxk80262, but asked to attach to node us1sxlxk80255" node=us1sxlxk80255
See the code https://github.com/longhorn/longhorn-manager/blob/ac065a8deb96193ffec142bcc7e7345615b57b8d/controller/volume_controller.go#L869
Moving to the next sprint, because @joshimoo is working on mTLS implementation instead.
@keithalucas has some idea about if we can come out with a workaround for this as a short-term solution.
I don't know if there is a scenario in which we generate the error here and this is actually the case that we are trying to start the longhorn engine on the new node and it is actually attached to the node v.Status.CurrentNodeID
. We are seeing it after every node in a cluster is restarted and the CurrentNodeID
was set before it was rebooted.
Given the problem that Phan described above and an issue with the same "already attached to node" log message with a customer that reboots all the nodes in a cluster, I am curious to see if we can continue with ReconcileVolumeState
function instead of returning an error. This could fix some of the issues before we refactor things.
Pre Ready-For-Testing Checklist
-
[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at: https://github.com/longhorn/longhorn/issues/2527#issuecomment-1568934161
-
[x] ~~Is there a workaround for the issue? If so, where is it documented? The workaround is at:~~
-
[x] Does the PR include the explanation for the fix or the feature?
-
[x] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at: https://github.com/longhorn/longhorn/pull/5940
-
[x] Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including
backport-needed/*
)? The PR is at https://github.com/longhorn/longhorn-manager/pull/1541 -
[x] Which areas/issues this PR might have potential impacts on? Area Attach/Detach operations Issues
-
[x] ~~If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted? The LEP PR is at~~
-
[x] ~~If labeled: area/ui Has the UI issue filed or ready to be merged (including
backport-needed/*
)? The UI issue/PR is at~~ -
[x] ~~If labeled: require/doc Has the necessary document PR submitted or merged (including
backport-needed/*
)? The documentation issue/PR is at~~ -
[x] ~~If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including
backport-needed/*
) The automation skeleton PR is at The automation test case PR is at The issue of automation test case implementation is at (please create by the template)~~ -
[x] ~~If labeled: require/automation-engine Has the engine integration test been merged (including
backport-needed/*
)? The engine automation PR is at~~ -
[x] ~~If labeled: require/manual-test-plan Has the manual test plan been documented? The updated manual test plan is at~~
-
[x] ~~If the fix introduces the code for backward compatibility Has a separate issue been filed with the label
release/obsolete-compatibility
? The compatibility issue is filed at~~
Test Plan:
Crash volumes
- Create a deployment using Longhorn volume
- Write some data to the Longhorn volume's mount point and calculate the checksum
- Crash all replicas of the volume
- Observe that the volume is auto-salvaged and auto reattached
- Observe that Longhorn kills the workload pod to remount the volume
- Verify that eventually workload pod becomes running. Read/Write to Longhorn volume is ok. Verify the checksum
Crash all nodes
- Create a deployment and statefulset using Longhorn volumes. Around 10 Longhorn volumes
- Reboot all nodes in the cluster near the same time
- Observe that the volumes are auto-salvaged and auto reattached
- Observe that Longhorn kills the workload pods to remount the volume
- Verify that eventually workload pods becomes running
Hacky test that verifies that the volume reconciliation loop is not stuck even if volume.spec.nodeID
!= volume.status.currentNodeID
- Create a Longhorn volume
- Use Longhorn UI to attach the volume to
node-1
. Wait for volume to become healthy - Edit the corresponding VolumeAttachment object to add one more attachment ticket with
attachmentID: manual-ticket
that selectsnode-2
- Observe that the ticket
manual-ticket
cannot be satisfied because thelonghorn-ui
ticket already request the volume attached tonode-1
-
kubectl edit <VOLUME-NAME> -n longhorn-system
to changevolume.spec.nodeID
fromnode-1
tonode-2
- Verify that volume detached from
node-1
and attached tonode-2
. It means that volume is not stuck even ifvolume.spec.nodeID
!=volume.status.currentNodeID
Verified pass in longhorn master(longhorn c81ddd
longhorn-manager 7fa846
) with test steps
- [x] Crash volumes : After all replica crashed, volume auto-salvaged and auto reattached, pod data intact.
- [x] Crash all nodes : After all nodes reboot ready, volume auto-salvaged and auto reattached, all pods data intact.
- [x] Hacky test that verifies that the volume reconciliation loop is not stuck even if
volume.spec.nodeID
!=volume.status.currentNodeID
Follow steps add attachmentTickets
with attachmentID: manual-ticket
, can see the Satisfied status is false
Volumeattachment before modify volume.spec.nodeID
apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
creationTimestamp: "2023-06-02T09:36:14Z"
finalizers:
- longhorn.io
generation: 3
labels:
longhornvolume: vol1
name: vol1
namespace: longhorn-system
ownerReferences:
- apiVersion: longhorn.io/v1beta2
kind: Volume
name: vol1
uid: 81c91ab3-8f25-4f45-9bd9-3244fb8e7cf1
resourceVersion: "11665"
uid: d9e1bc35-cfb8-4208-a3f7-9fd3f127e0ed
spec:
attachmentTickets:
"":
generation: 0
id: ""
nodeID: ip-172-31-18-66
parameters:
disableFrontend: "false"
lastAttachedBy: ""
type: longhorn-api
manual-ticket:
generation: 1
id: manual-ticket
nodeID: ip-172-31-19-6
parameters:
disableFrontend: "false"
lastAttachedBy: ""
type: longhorn-api
volume: vol1
status:
attachmentTicketStatuses:
"":
conditions:
- lastProbeTime: ""
lastTransitionTime: "2023-06-02T09:36:25Z"
message: ""
reason: ""
status: "True"
type: Satisfied
generation: 0
id: ""
satisfied: true
manual-ticket:
conditions:
- lastProbeTime: ""
lastTransitionTime: "2023-06-02T09:37:27Z"
message: 'the volume is currently attached to different node ip-172-31-18-66 '
reason: ""
status: "False"
type: Satisfied
generation: 1
id: manual-ticket
satisfied: false
After modify volume.spec.nodeID to node 2 (ip-172-31-19-6), can see the Satisfied status of manual-ticket is true
and volume attached to node 2 healthy
Volumeattachment after modify volume.spec.nodeID
apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
creationTimestamp: "2023-06-02T09:36:14Z"
finalizers:
- longhorn.io
generation: 3
labels:
longhornvolume: vol1
name: vol1
namespace: longhorn-system
ownerReferences:
- apiVersion: longhorn.io/v1beta2
kind: Volume
name: vol1
uid: 81c91ab3-8f25-4f45-9bd9-3244fb8e7cf1
resourceVersion: "12439"
uid: d9e1bc35-cfb8-4208-a3f7-9fd3f127e0ed
spec:
attachmentTickets:
"":
generation: 0
id: ""
nodeID: ip-172-31-18-66
parameters:
disableFrontend: "false"
lastAttachedBy: ""
type: longhorn-api
manual-ticket:
generation: 1
id: manual-ticket
nodeID: ip-172-31-19-6
parameters:
disableFrontend: "false"
lastAttachedBy: ""
type: longhorn-api
volume: vol1
status:
attachmentTicketStatuses:
"":
conditions:
- lastProbeTime: ""
lastTransitionTime: "2023-06-02T09:42:43Z"
message: 'the volume is currently attached to different node ip-172-31-19-6 '
reason: ""
status: "False"
type: Satisfied
generation: 0
id: ""
satisfied: false
manual-ticket:
conditions:
- lastProbeTime: ""
lastTransitionTime: "2023-06-02T09:42:49Z"
message: ""
reason: ""
status: "True"
type: Satisfied
generation: 1
id: manual-ticket
satisfied: true
root@ip-172-31-18-63:/hom