nomad icon indicating copy to clipboard operation
nomad copied to clipboard

CSI: volume has exhausted its available writer claims

Open ygersie opened this issue 2 years ago • 3 comments

Nomad version

1.4.1-ent

Issue

We're still running into the issue of jobs getting stuck pending waiting for a volume claim to be released as described here: https://github.com/hashicorp/nomad/issues/12346#issuecomment-1135059394. The trigger seems to be related to https://github.com/hashicorp/nomad/issues/8336. Multiple jobs were stopped / started at the same time, the EBS controller threw an error Attachment point /dev/xvdba is already in use. The allocations were in a failed state and consequent job stops and starts now incorrectly report there's already a claim.

I've checked the AWS side but the volumes are not actually mounted to any instances, nor does a nomad volume status report associated allocs (possibly due to GC). When stopping + purging the respective job the volumes that failed to be claimed look like:

ID                 Name               Plugin ID  Schedulable  Access Mode
zookeeper[0]       zookeeper[0]       aws-ebs    true         single-node-writer
zookeeper[1]       zookeeper[1]       aws-ebs    true         single-node-writer
zookeeper[2]       zookeeper[2]       aws-ebs    true         <none>

The first two volumes throw claim errors and have "Access Mode" still set to single-node-writer and a volume status reports: No allocations placed. At some point after multiple job stops, running the job now fails to be placed with:

Constraint "CSI volume zookeeper[0] has exhausted its available writer claims and is claimed by a garbage collected allocation 138a0fe3-8f9a-6176-014a-d46b7ee37249; waiting for claim to be released

Reproduction steps

Try concurrently spinning up multiple allocations at once to trigger issue #8336. Then stop/start jobs that failed before.

ygersie avatar Nov 29 '22 12:11 ygersie

Thanks for the report @ygersie.

We will need some more time to investigate this as it's not immediately clear to me what's going on. I've added this to our backlog so we can triage it more.

lgfa29 avatar Dec 01 '22 01:12 lgfa29

@lgfa29 thank you. For now I've reduced the controller count to 1 instead of 2 since the NewDevice method has a mutex to seemingly prevent this issue from occurring. Either way Nomad should release the volume after the controllerpublishvolume method failed with an error which it doesn't seem to do.

ygersie avatar Dec 01 '22 14:12 ygersie

Hey, observed the same issue on Nomad OSS 1.4.3 with the Linode CSI driver.

One workaround for me to be able to recover from this without reregistering the volume is:

I lowered csi_volume_claim_gc_threshold to 3 minutes. Paired with a restart block of:

    restart {
       delay = "3m"
       mode = "delay"
       attempts = 1
       interval = "6m"
    }

Other wise fully stopping the task and running "nomad system gc" also seems todo the trick.

Xuxe avatar Jan 03 '23 09:01 Xuxe

Still running into this issue, now after the 1.5.1 upgrade. In one of our clusters I have an alloc that fails to be scheduled due to:

Constraint "CSI volume vmstorage-a[9] has exhausted its available writer claims and is claimed by a garbage collected allocation dfc38c1f-2555-0fcd-cb89-fe3c8be7a63e; waiting for claim to be released": 17 nodes excluded by filter

I checked the code and the volumes_watcher routine should handle the releasing of the active claim but this isn't happening. Looking at the current Nomad leader logs there was nothing showing up related to this at all. I then decided to trigger a leader failover by shutting down the active Nomad leader. The new leader is spamming the logs hard with the following:

    2023-03-20T11:42:24.564Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.569Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.574Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.579Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.584Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c

Trying a force detach never works in any of these scenarios either:

$ nomad volume detach vmstorage-a[9] ba33df3d-a1cc-6db3-9986-ede682b6c94c
Error detaching volume: Unexpected response code: 500 (rpc error: Unknown allocation)

And I'm again stuck with many allocations failing to be scheduled and the only way out of this mess is by force re-registering the volume. One of our clusters has a couple dozen of failed allocs for which we have to go through each job to figure out which volumes are stuck.

Just for additional info, my upgrade procedure for a cluster is as follows:

  • Upgrade servers 1 by 1.
  • Drain each client while leaving the system jobs running 1 by 1 waiting for each node to be successfully replaced. Only when the plugins on the replacement node are healthy we sleep for another minute before moving on the next instance. We have to drain the nodes to prevent the issues of zombie containers being created due to another issue related to csi plugins (a race on startup) which makes another gigantic mess of the volume and other resource claims.

We only run a single instance of the CSI controller since it looks like that EBS controller was never built to be run in HA mode, see my previous comment above

ygersie avatar Mar 20 '23 12:03 ygersie

I wonder if problems here are caused by the unavailability of the CSI driver the moment a drain occurs. I run only a single instance of the EBS CSI controller as mentioned previously. When I drain a node that runs this controller there's temporary unavailability meaning any volumes for jobs running on that box couldn't be released at that point until the controller is back (should be back within a minute on another node though).

Just to be sure I scripted the draining which now checks if the CSI controller is running on the to be drained node, it'll mark the node ineligible and stop the allocation to have it moved somewhere else before proceeding with the drain. The last test I did, although a smaller cluster, seems to have gone fine using this sequence.

ygersie avatar Mar 24 '23 11:03 ygersie

Sorry, it's been a bit since I've had a chance to review this and I'm ramping up to help @gulducat work on the other issue you have open around restarts.

I run only a single instance of the EBS CSI controller as mentioned previously. When I drain a node that runs this controller there's temporary unavailability meaning any volumes for jobs running on that box couldn't be released at that point until the controller is back (should be back within a minute on another node though).

That's expected behavior and there's very little Nomad can do about this. The EBS CSI driver is frankly buggy if it can't run more than one instance at a time. We handle this as gracefully as we can by having the claim GC process. We call this out in the CSI Plugins concept docs:

Controller plugins can create and attach volumes anywhere they can communicate with the storage provider's API, so they can usually be run as service jobs. You should always run more than one controller plugin allocation for high availability.

tgross avatar Mar 24 '23 12:03 tgross

@tgross the EBS controller can't run multiple instances because it assigns a devicename. If you run a single controller that's not an issue as the devicename assignment is guarded by a mutex but when you run multiple controllers there is a problem as you end up with conflicting devicenames when starting multiple tasks at once for the same node.

I understand there's now a GC happening as a trade-off but the issue is, it wasn't working. See the logs I posted above:

    2023-03-20T11:42:24.564Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

The claims were never released. First there was an issue where the Nomad leader wasn't logging anything, then I decided to force a leader failover by shutting down the active leader and then the Nomad server started spamming logs with above message. The volumes_watcher routine wasn't able to release the claim.

ygersie avatar Mar 24 '23 15:03 ygersie

the EBS controller can't run multiple instances because it assigns a devicename. If you run a single controller that's not an issue as the devicename assignment is guarded by a mutex but when you run multiple controllers there is a problem as you end up with conflicting devicenames when starting multiple tasks at once for the same node.

That makes sense. Apparently the EBS controller supports some kind of leader election (ref Client Restarts) but I can't find any documentation for it.

I understand there's now a GC happening as a trade-off but the issue is, it wasn't working.

Understood. I suspect that what we've got here is a case where Nomad has multiple RPCs in-flight to the controller for the same volume. We're not supposed to be doing that either (it's a "should" in the spec, not a "must"), but it's clear that at least a few plugins don't tolerate it (even though it's a "must" for them). I've got some thoughts I'm putting together around running volume claim updates thru the eval broker which might help with this whole class of problem.

tgross avatar Mar 28 '23 17:03 tgross

Ok, thanks for your patience all! I'm picking this back up and I did some digging into the EBS plugin's behavior and design. From their design docs:

Chooses the right device name for the volume on the node (more on that below) and issues AttachVolume. TODO: this has complicated idempotency expectations. It cancels previously called ControllerUnpublishVolume that may be still in progress (i.e. AWS is still detaching the volume and Kubernetes now wants the volume to be attached back).

"It's complicated" isn't exactly encouraging. On further investigation I discovered that the EBS plugin does require leader election, and that it leans on k8s-specific sidecar containers to do so (ref controller.yaml#L137). The underlying problem is challenging within the boundaries of the CSI spec, but now we have to compensate for the unfortunate choice of the plugins to lean on k8s-specific APIs.

My next pass at this will be twofold:

  • Serialize the controller publish/unpublish RPCs on a per-plugin basis. This is frankly not ideal, as it's going to impact allocation startup performance quite a bit for jobs that spin up a lot of allocations using the same plugin but different volumes. We might be able to get with having the serialization key being the plugin ID + client ID, which would be more fine-grained. But I need to do more testing.
  • Use a simply bully selection (lowest client ID) so that we generally send these RPCs to the same controller instance each time, unless they're unavailable.

I have a working branch at csi-concurrent-requests, but haven't done more than some simple smoke testing of that yet. I'm not entirely convinced that concurrent requests are the only problem we're facing here. There may also be some interaction with https://github.com/hashicorp/nomad/issues/17756 and I want to pull on that thread as well. Will update here (and potentially in #17756) as I know more.

tgross avatar Jul 17 '23 21:07 tgross

Draft PR is up here https://github.com/hashicorp/nomad/pull/17996. I've tested that out with AWS EBS and it looks to make things much more stable.

tgross avatar Jul 19 '23 21:07 tgross

I've just merged #17996 with the fix. It looks like we're going to have a 1.6.1 fairly soon because of another open issue, along with backports, so expect to see this roll out shortly.

tgross avatar Jul 20 '23 18:07 tgross

Awesome stuff @tgross really appreciate you taking another stab at this!

ygersie avatar Jul 20 '23 22:07 ygersie