local-path-provisioner icon indicating copy to clipboard operation
local-path-provisioner copied to clipboard

provisioner doesn't like when nodes go away, VolumeFailedDelete

Open tedder opened this issue 6 years ago • 5 comments

I'm running on some bare metal servers, and if one of them goes away (effectively permanently), PVs and PVCs don't get reaped, so the pods (created as a statefulset) can't recover.

This may be okay. Let me know if you'd like a reproducable example, or if it's a conceptual thing.

Here's an example I just ran across. It's a STS of Elasticsearch pods. Having persistent data is great, but if a server goes away, the pod just sits in purgatory.

$ kubectl describe pod esdata-4 -n kube-logging
...
Status:               Pending
...
    Mounts:
      /var/lib/elasticsearch from esdata-data (rw,path="esdata_data")
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  esdata-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  esdata-data-esdata-4
    ReadOnly:   false
...
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  3m43s (x45 over 39m)  default-scheduler  0/5 nodes are available: 5 node(s) had volume node affinity conflict.


$ kubectl describe pv pvc-3628fa90-9e11-11e9-83ca-d4bed9ad776a
...
Annotations:       pv.kubernetes.io/provisioned-by: rancher.io/local-path
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      local-path
Status:            Released
Claim:             kube-logging/esdata-data-esdata-4
Reclaim Policy:    Delete
Node Affinity:     
  Required Terms:  
    Term 0:        kubernetes.io/hostname in [DEADHOST]
...
Events:
  Type     Reason              Age                  From                                                                                               Message
  ----     ------              ----                 ----                                                                                               -------
  Warning  VolumeFailedDelete  37s (x4 over 2m57s)  rancher.io/local-path_local-path-provisioner-f7986dc46-cg8nl_7ff46af3-9e4f-11e9-a883-fa15f9dfdfe0  failed to delete volume pvc-3628fa90-9e11-11e9-83ca-d4bed9ad776a: failed to delete volume pvc-3628fa90-9e11-11e9-83ca-d4bed9ad776a: pods "delete-pvc-3628fa90-9e11-11e9-83ca-d4bed9ad776a" not found

I can delete it manually, just kubectl delete pv [pvid]. I then have to create the pv and pvc, also manually, before the pod is happy. I assumed there'd be a timeout reaping PVCs from dead nodes.

cc @tamsky in case he's come across this, as I see he's been around this repo.

tedder avatar Jul 06 '19 01:07 tedder

Hi We are struggeling with the same problem.

How can I remove a specific node when I deploy stateful sets with this provisioner? Also scaling down from 3 to 1 replica of our databases is not possible. I can only scale down to the node where the pod named *-0 runs. Is there a way to manually move pv(c)s to other nodes? How do you deal with node removal?

landorg avatar Sep 10 '20 14:09 landorg

Moving PVC is not possible with this provisioner, since the whole concept with local path provisioner is the storage will be always on the node.

After the node is removed, you need to scale down the workload, delete the related PVC and PV, then scale it back to create a new replacement PVC/PV for it. That's assuming your application can rebuild the missing replica on the lost node.

If you want to move the data manually, you can copy the data out, recreate the PVC and PV with the same name, then use another pod to copy the data back, and finally start the workload. But this provisioner is not designed to be used in this way, since it mostly relies on the application replication function to keep the storage distributed across different nodes. It doesn't provide this function of it own.

If you want to have a solution to provider high-availability persistent storage to Kubernetes, you can take a look at Longhorn.

yasker avatar Sep 10 '20 22:09 yasker

Thanks for clarification @yasker I understand that this is relying on the application for replication of the data and this makes sense for our use-case. We're using it for our distributed databases. It would just be nice to have a little bit more flexibility.

If I understood everything correctly I could scale down to a specific node I could

  • scale down to 0
  • copy the data out of the volume(s)
  • delete the pvcs
  • set node labels & node selector so that the only statefulset is scheduled to my preferred node
  • scale up to 1 again

landorg avatar Sep 11 '20 08:09 landorg

And you need to copy the data back once you scale up to 1. In fact at the scaleup step, you also need to

  1. (following your steps) scale up to 1 again
    1. This step will create a PVC for us to copy data to.
  2. scale down to 0
  3. Create another pod to use the same PVC
  4. Copy data into it
  5. Delete the pod but retain PVC
  6. Scale up the stateful set back to 1

yasker avatar Sep 11 '20 21:09 yasker

Hi,

Can we auto-assign each cluster node to the pods when we scale -up using Statefulset instead of setting up node selector

nishit93-hub avatar Jun 20 '22 06:06 nishit93-hub