sig-storage-lib-external-provisioner icon indicating copy to clipboard operation
sig-storage-lib-external-provisioner copied to clipboard

Provisioner fails with "error syncing claim: node not found" after "final error received, removing pvc"

Open judemars opened this issue 1 year ago • 20 comments

This is a follow-on from https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/issues/121.

We are still seeing "error syncing claim: node not found" with "final error received, removing pvc x from claims in progress" @songsunny made a fix for this, however, the fix does not handle the final/non-final error. So the fix can cause the PD to leak in this scenario: https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/pull/139#discussion_r1131778146

judemars avatar Aug 15 '23 04:08 judemars

/sig storage /kind bug

sunnylovestiramisu avatar Aug 16 '23 21:08 sunnylovestiramisu

Let's continue the conversation here. @msau42 @jsafrane

If the provisioning has started, shouldn't the state changed to ProvisioningInBackground?

If it is ProvisioningInBackground and node is not found, we do not go down the code path: return ctrl.provisionVolumeErrorHandling(ctx, ProvisioningReschedule, err, claim, operation)

Instead we continue to the code:

err = fmt.Errorf("failed to get target node: %v", err)
ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, "ProvisioningFailed", err.Error())
return ProvisioningNoChange, err

sunnylovestiramisu avatar Aug 22 '23 22:08 sunnylovestiramisu

@sunnylovestiramisu that will work only when we expect that the missing node comes back online. Is this the case we're trying to solve here? Because if the node was permanently deleted, then the provisioner will end in an endless loop.

To fix the issue for all possible library users, the library should save node somewhere (PVC annotation?!) and give it to the provisioner, so it can retry until it gets a final success / error without getting the Node from the API server.

But a whole node in PVC annotations is really ugly. CSI provisioner will need just node name from it to get CSINode and driver's topology labels from it*. So... should there be an extra call, say ExtractNodeInfo, where the provisioner would get whatever it wants from the SelectedNode (and perhaps even CSINode) and return a shorter string that the library would save in the PVC? Then the library could give the string to every Provision() call until a final success/error. This is change of the library API. Is it worth the effort?

*) There is another question what should happen when CSINode is missing here: https://github.com/kubernetes-csi/external-provisioner/blob/3739170578f68aaf0594f631cae6d270bbfdc83e/pkg/controller/topology.go#L273

jsafrane avatar Aug 23 '23 11:08 jsafrane

@jsafrane we are trying to solve the case that the provisioning did start already on the backend but then the node got deleted afterwards.

sunnylovestiramisu avatar Aug 23 '23 15:08 sunnylovestiramisu

Question is, do you expect the node to come back and optimize for this case? That's manageable.

jsafrane avatar Aug 24 '23 10:08 jsafrane

The node with the same node name will not come back if it gets deleted. But if a node with the same node name but a different UID may come back. Or if there is any other cases that I miss.

sunnylovestiramisu avatar Aug 24 '23 18:08 sunnylovestiramisu

There's 2 main reasons why a Node about may disappear:

  1. The node got autoscaled down. In this case the node is not expected to come back.
  2. The node got pre-emped and recreated. The node may come back, and often it comes back quickly.

The original motivation for https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/issues/121 was for the autoscaling case where the node never comes back.

msau42 avatar Aug 24 '23 19:08 msau42

The original motivation for https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/issues/121 was for the autoscaling case where the node never comes back

That's the hard case. The provisioner lib then needs to store the whole node somewhere, so it can reconstruct it and give it to the provisioner in the next Provision() calls, until it succeeds / fails with a final error.

Since storing the whole node is ugly, we could extend the provisioner API, so the provisioner can distill a shorter string from the Node and then the library would store it (in PVC annotations?) and give it to the provisioner in all subsequent Provision calls.

jsafrane avatar Aug 25 '23 08:08 jsafrane

What happens if a node with the exact name but different uid comes back? Does this shorter string provide enough information for the provisioner to distinguish the node and decide what action to take?

sunnylovestiramisu avatar Aug 25 '23 17:08 sunnylovestiramisu

Question is, if selectedNode annotation is valid when someone deletes a node and creates a new one with the same name. Assuming yes, the new node is still the right one, then I think the shorter string could be used only as a fallback, when the Node object is not in the API server (or in informer).

jsafrane avatar Aug 28 '23 11:08 jsafrane

But how do we know if a node will come back or not? How do we know if it will be recreated from symptoms? Let say we store the nodeName in annotations.

  1. The node got autoscaled down. In this case the node is not expected to come back. <-- We will see apierrs.IsNotFound(err) to be true always. And we still have nodeName in annotations. What do we do next?

  2. The node got pre-emped and recreated. The node may come back, and often it comes back quickly. <-- We will see apierrs.IsNotFound(err) to be true in a short period of time. And we still have nodeName in annotations. And we use that nodeName to skip the ctrl.provisionVolumeErrorHandling(ctx, ProvisioningReschedule, err, claim, operation)?

sunnylovestiramisu avatar Aug 28 '23 18:08 sunnylovestiramisu

I don't think if it matters if the node will come back or not. What we return depends on if the operation returned final or non-final error. Whether or not the node exists or not determines how we retry in the non-final case.

And so the question is for the non-final case, should we retry the CreateVolume() call with the old node (Jan's proposal), or get a new node from the scheduler (What the fix in #139 was trying to do).

I think the challenge with getting a new node is what happens if the topology also changes? Then a fix would involve:

  • The driver being able to handle a change in requested topology for an existing operation
  • provisioner being able to detect that the returned topology doesn't match the request and coordinating a deletion + retry

msau42 avatar Aug 28 '23 21:08 msau42

And so the question is for the non-final case, should we retry the CreateVolume() call with the old node (Jan's proposal), or get a new node from the scheduler (What the fix in https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/pull/139 was trying to do).

What if there is no node? Someone has deleted the Pod and PVC and provisioning is in progress. The provisioning should continue and eventually succeed or fail, but it needs some node to continue.

jsafrane avatar Sep 06 '23 13:09 jsafrane

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 27 '24 18:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 26 '24 19:02 k8s-triage-robot

/remove-lifecycle rotten

msau42 avatar Feb 27 '24 00:02 msau42