k3s icon indicating copy to clipboard operation
k3s copied to clipboard

Volumes stuck in Released state

Open zc-devs opened this issue 10 months ago • 4 comments

Environmental Info: K3s Version: v1.29.3+k3s1 Local path provisioner: v0.0.26

Node(s) CPU architecture, OS, and Version: Linux 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 20 04:52:13 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 servers, embedded etcd

Describe the bug: After deploying new 1.29.3 cluster I noticed that PVs doesn't delete and stuck in released state, even though they have persistentVolumeReclaimPolicy: Delete.

Steps To Reproduce:

  1. Install K3s cluster.
  2. Check Local path provisioner version:
# kubectl get deployment -n kube-system local-path-provisioner -o=jsonpath='{$.spec.template.spec.con
tainers[:1].image}'
rancher/local-path-provisioner:v0.0.26
  1. Create test pvc.yaml.
kubectl apply -f pvc.yaml
  1. Create test pod.yaml.
kubectl apply -f pod.yaml
  1. Delete Pod.
kubectl delete -f pod.yaml
  1. Delete PVC.
kubectl delete -f pod.yaml
  1. Check volume.
# kubectl get persistentvolume | grep test-pvc
pvc-3f2233a9-795e-4ba0-a52f-e4bf335979a4   10Mi       RWO            Delete           Released   kube-system/test-pvc             local-ssd      <unset>                          17m
  1. Check logs of Local path provisioner (after applying a workaround from #9834). local-path-provisioner.log

Expected behavior: Persistent volume deletes, there are no errors in Local path provisioner, there are no failed helper-pod-delete-pvc-* Pods.

Actual behavior: Persistent volume doesn't delete, but stuck in released state, there are errors in Local path provisioner's logs and failed helper-pod-delete-pvc-* pods appear.

Additional context / logs: helper-pod-delete-pvc-3f2233a9-795e-4ba0-a52f-e4bf335979a4.yml

Workaround: If downgrade Local path provisioner to v0.0.24, then previously stuck in released state PVs automatically get deleted.

zc-devs avatar Mar 29 '24 21:03 zc-devs

First change may be introducing new opt:

  setup: |-
    #!/bin/sh
    while getopts "m:s:p:a:" opt
    do
        case $opt in
            p)
            absolutePath=$OPTARG
            ;;
            s)
            sizeInBytes=$OPTARG
            ;;
            m)
            volMode=$OPTARG
            ;;
            a)
            action=$OPTARG
            ;;
        esac
    done
    if [ "$action" = "create" ]
    then
      mkdir -m 0777 -p ${absolutePath}
      chmod 700 ${absolutePath}/..
    fi
  teardown: |-
    #!/bin/sh
    set -x
    while getopts "m:s:p:a:" opt
    do
        case $opt in
            p)
            absolutePath=$OPTARG
            ;;
            s)
            sizeInBytes=$OPTARG
            ;;
            m)
            volMode=$OPTARG
            ;;
            a)
            action=$OPTARG
            ;;
        esac
    done
    if [ "$action" = "delete" ]
    then
      rm -rf ${absolutePath}
    fi

In that case I don't get Illegal option -a error, but helper pod fails anyways. Then I tried to debug:

  teardown: |-
    #!/bin/sh
    sleep infinity
/ # ls -lah /var/lib/rancher/k3s/storage/local/ssd/pvc-a9809075-743a-4a86-ba28-38ca3d8256cd_kube-system_test-pvc
ls: can't open '/var/lib/rancher/k3s/storage/local/ssd/pvc-a9809075-743a-4a86-ba28-38ca3d8256cd_kube-system_test-pvc': Permission denied
total 0

/ # ls -lah /var/lib/rancher/k3s/storage/local/ssd
total 3G
drwx------   14 root     root        4.0K Mar 29 22:12 .
drwxr-xr-x    3 root     root        4.0K Mar 29 22:33 ..
drwxrwxrwx    2 root     root        4.0K Mar 29 22:13 pvc-a9809075-743a-4a86-ba28-38ca3d8256cd_kube-system_test-pvc

/ # id
uid=0(root) gid=0(root) groups=0(root),10(wheel)

Then I compared Pod definitions between v0.24 and v0.26: Screenshot 2024-03-30

zc-devs avatar Mar 29 '24 22:03 zc-devs

TLDR from above link where I encountered this issue while testing the latest COMMIT_ID for v1.28 branch

$ kg pv -A

NAME            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM              STORAGECLASS   REASON   AGE
checking-path   5Gi        RWO            Recycle          Failed   default/test-pvc   local-path              50m

VestigeJ avatar Apr 15 '24 22:04 VestigeJ

is this resolved already?

0xMALVEE avatar May 15 '24 19:05 0xMALVEE

nope. issue is still open and PR is not merged. Waiting for end of code freeze.

brandond avatar May 16 '24 19:05 brandond

Hi, I've just tested #9964 with Local path config: local-storage.yaml.

While PV creates, it cannot be deleted, the helper pod fails.

helper-pod-create-pvc-v0.24.yaml helper-pod-create-pvc-v0.26.yaml helper-pod-delete-pvc-v0.24.yaml helper-pod-delete-pvc-v0.26.yaml

I think, the main difference is v0.26 doesn't use privileged security context flag, as I've noticed in https://github.com/k3s-io/k3s/issues/9833#issuecomment-2027802148. Have to mention also, that I use Oracle Linux 9 with enabled SELinux. Directory permissions are:

# ls -lanZ /var/lib/rancher/k3s/storage/
total 48
drwx------. 12    0    0 system_u:object_r:container_file_t:s0           4096 May 21 19:24 .
drwxr-xr-x.  6    0    0 system_u:object_r:container_var_lib_t:s0        4096 Jan 18 18:32 ..

Volume permissions (v0.24):

# ls -lanZ pvc-9391085e-05dc-42f4-9c0d-25a1b4e6fe4a_kube-system_test-pvc/
total 12
drwxrwxrwx.  2 0 0 system_u:object_r:container_file_t:s0:c131,c199 4096 May 21 19:26 .
drwx------. 12 0 0 system_u:object_r:container_file_t:s0           4096 May 21 19:24 ..
-rw-r--r--.  1 0 0 system_u:object_r:container_file_t:s0:c131,c199    5 May 21 19:29 test.txt

Volume permissions (v0.26):

# ls -lanZ pvc-1fdbed66-a718-463d-908e-21bd731934c4_kube-system_test-pvc/
total 12
drwxrwxrwx.  2 0 0 system_u:object_r:container_file_t:s0:c143,c741 4096 May 21 19:34 .
drwx------. 11 0 0 system_u:object_r:container_file_t:s0           4096 May 21 19:33 ..
-rw-r--r--.  1 0 0 system_u:object_r:container_file_t:s0:c143,c741    5 May 21 19:34 test.txt

Should I file a separate issue?

zc-devs avatar May 21 '24 16:05 zc-devs

yes, that sounds like a separate issue. I don't see any difference in the volume permissions or contexts between the two versions though? It is expected that the numeric portion at the end will differ.

brandond avatar May 21 '24 17:05 brandond

##Environment Details Attempted to reproduce but didn't hit it this time using VERSION=v1.29.5+k3s1 nor VERSION=v1.30.1+k3s1 Validated using COMMIT=cff6f7aa1d7987a658b030b2bc69df7c25f515c8

Infrastructure

  • [X] Cloud

Node(s) CPU architecture, OS, and version:

Linux 5.14.21-150500.53-default x86_64 GNU/Linux PRETTY_NAME="SUSE Linux Enterprise Server 15 SP5"

Cluster Configuration:

NAME               STATUS   ROLES                       AGE   VERSION
ip-3-3-8-8         Ready    control-plane,etcd,master   23m   v1.30.1+k3s-cff6f7aa

Config.yaml:

node-external-ip: 3.3.8.8
token: YOUR_TOKEN_HERE
write-kubeconfig-mode: 644
debug: true
cluster-init: true
embedded-registry: true

Reproduction

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ COMMIT=cff6f7aa1d7987a658b030b2bc69df7c25f515c8
$ sudo INSTALL_K3S_COMMIT=$COMMIT INSTALL_K3S_EXEC=server ./install-k3s.sh
$ kg no,po -A
$ vim pvc.yaml
$ vim pv-pod.yaml
$ k apply -f pvc.yaml
$ kg pvc -A
$ k apply -f pv-pod.yaml
$ kg po,pv,pvc -A
$ k delete -f pv-pod.yaml
$ k delete -f pvc.yaml
$ kg pv -A
$ kubectl get deployment -n kube-system local-path-provisioner -o=jsonpath='{$.spec.template.spec.containers[:1].image}'
$ k apply -f pvc.yaml
$ k apply -f pv-pod.yaml
$ kg pv,pod,pvc -A
$ k delete -f pvc.yaml; sleep 40; k delete -f pv-pod.yaml
$ kg pv,pvc -A
$ k apply -f pvc.yaml
$ k apply -f pv-pod.yaml
$ k delete -f pvc.yaml
$ k delete -f pv-pod.yaml
$ kg pvc,pv -A

Results:

No dangling leftover volume claims, they're reclaimed appropriately.

$ cat pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
  namespace: kube-system
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Mi

$ cat pv-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  namespace: kube-system
spec:
  containers:
    - name: debian
      image: digitalocean/doks-debug
      command: ["sleep", "infinity"]
      volumeMounts:
        - name: data
          mountPath: /data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: test-pvc

VestigeJ avatar Jun 12 '24 23:06 VestigeJ