k3s
k3s copied to clipboard
Volumes stuck in Released state
Environmental Info: K3s Version: v1.29.3+k3s1 Local path provisioner: v0.0.26
Node(s) CPU architecture, OS, and Version: Linux 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 20 04:52:13 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: 3 servers, embedded etcd
Describe the bug:
After deploying new 1.29.3 cluster I noticed that PVs doesn't delete and stuck in released
state, even though they have persistentVolumeReclaimPolicy: Delete
.
Steps To Reproduce:
- Install K3s cluster.
- Check Local path provisioner version:
# kubectl get deployment -n kube-system local-path-provisioner -o=jsonpath='{$.spec.template.spec.con
tainers[:1].image}'
rancher/local-path-provisioner:v0.0.26
- Create test pvc.yaml.
kubectl apply -f pvc.yaml
- Create test pod.yaml.
kubectl apply -f pod.yaml
- Delete Pod.
kubectl delete -f pod.yaml
- Delete PVC.
kubectl delete -f pod.yaml
- Check volume.
# kubectl get persistentvolume | grep test-pvc
pvc-3f2233a9-795e-4ba0-a52f-e4bf335979a4 10Mi RWO Delete Released kube-system/test-pvc local-ssd <unset> 17m
- Check logs of Local path provisioner (after applying a workaround from #9834). local-path-provisioner.log
Expected behavior:
Persistent volume deletes, there are no errors in Local path provisioner, there are no failed helper-pod-delete-pvc-*
Pods.
Actual behavior:
Persistent volume doesn't delete, but stuck in released
state, there are errors in Local path provisioner's logs and failed helper-pod-delete-pvc-*
pods appear.
Additional context / logs: helper-pod-delete-pvc-3f2233a9-795e-4ba0-a52f-e4bf335979a4.yml
Workaround:
If downgrade Local path provisioner to v0.0.24, then previously stuck in released
state PVs automatically get deleted.
First change may be introducing new opt:
setup: |-
#!/bin/sh
while getopts "m:s:p:a:" opt
do
case $opt in
p)
absolutePath=$OPTARG
;;
s)
sizeInBytes=$OPTARG
;;
m)
volMode=$OPTARG
;;
a)
action=$OPTARG
;;
esac
done
if [ "$action" = "create" ]
then
mkdir -m 0777 -p ${absolutePath}
chmod 700 ${absolutePath}/..
fi
teardown: |-
#!/bin/sh
set -x
while getopts "m:s:p:a:" opt
do
case $opt in
p)
absolutePath=$OPTARG
;;
s)
sizeInBytes=$OPTARG
;;
m)
volMode=$OPTARG
;;
a)
action=$OPTARG
;;
esac
done
if [ "$action" = "delete" ]
then
rm -rf ${absolutePath}
fi
In that case I don't get Illegal option -a
error, but helper pod fails anyways.
Then I tried to debug:
teardown: |-
#!/bin/sh
sleep infinity
/ # ls -lah /var/lib/rancher/k3s/storage/local/ssd/pvc-a9809075-743a-4a86-ba28-38ca3d8256cd_kube-system_test-pvc
ls: can't open '/var/lib/rancher/k3s/storage/local/ssd/pvc-a9809075-743a-4a86-ba28-38ca3d8256cd_kube-system_test-pvc': Permission denied
total 0
/ # ls -lah /var/lib/rancher/k3s/storage/local/ssd
total 3G
drwx------ 14 root root 4.0K Mar 29 22:12 .
drwxr-xr-x 3 root root 4.0K Mar 29 22:33 ..
drwxrwxrwx 2 root root 4.0K Mar 29 22:13 pvc-a9809075-743a-4a86-ba28-38ca3d8256cd_kube-system_test-pvc
/ # id
uid=0(root) gid=0(root) groups=0(root),10(wheel)
Then I compared Pod definitions between v0.24 and v0.26:
TLDR from above link where I encountered this issue while testing the latest COMMIT_ID for v1.28 branch
$ kg pv -A
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
checking-path 5Gi RWO Recycle Failed default/test-pvc local-path 50m
is this resolved already?
nope. issue is still open and PR is not merged. Waiting for end of code freeze.
Hi, I've just tested #9964 with Local path config: local-storage.yaml.
While PV creates, it cannot be deleted, the helper pod fails.
helper-pod-create-pvc-v0.24.yaml helper-pod-create-pvc-v0.26.yaml helper-pod-delete-pvc-v0.24.yaml helper-pod-delete-pvc-v0.26.yaml
I think, the main difference is v0.26 doesn't use privileged
security context flag, as I've noticed in https://github.com/k3s-io/k3s/issues/9833#issuecomment-2027802148. Have to mention also, that I use Oracle Linux 9 with enabled SELinux. Directory permissions are:
# ls -lanZ /var/lib/rancher/k3s/storage/
total 48
drwx------. 12 0 0 system_u:object_r:container_file_t:s0 4096 May 21 19:24 .
drwxr-xr-x. 6 0 0 system_u:object_r:container_var_lib_t:s0 4096 Jan 18 18:32 ..
Volume permissions (v0.24):
# ls -lanZ pvc-9391085e-05dc-42f4-9c0d-25a1b4e6fe4a_kube-system_test-pvc/
total 12
drwxrwxrwx. 2 0 0 system_u:object_r:container_file_t:s0:c131,c199 4096 May 21 19:26 .
drwx------. 12 0 0 system_u:object_r:container_file_t:s0 4096 May 21 19:24 ..
-rw-r--r--. 1 0 0 system_u:object_r:container_file_t:s0:c131,c199 5 May 21 19:29 test.txt
Volume permissions (v0.26):
# ls -lanZ pvc-1fdbed66-a718-463d-908e-21bd731934c4_kube-system_test-pvc/
total 12
drwxrwxrwx. 2 0 0 system_u:object_r:container_file_t:s0:c143,c741 4096 May 21 19:34 .
drwx------. 11 0 0 system_u:object_r:container_file_t:s0 4096 May 21 19:33 ..
-rw-r--r--. 1 0 0 system_u:object_r:container_file_t:s0:c143,c741 5 May 21 19:34 test.txt
Should I file a separate issue?
yes, that sounds like a separate issue. I don't see any difference in the volume permissions or contexts between the two versions though? It is expected that the numeric portion at the end will differ.
##Environment Details Attempted to reproduce but didn't hit it this time using VERSION=v1.29.5+k3s1 nor VERSION=v1.30.1+k3s1 Validated using COMMIT=cff6f7aa1d7987a658b030b2bc69df7c25f515c8
Infrastructure
- [X] Cloud
Node(s) CPU architecture, OS, and version:
Linux 5.14.21-150500.53-default x86_64 GNU/Linux PRETTY_NAME="SUSE Linux Enterprise Server 15 SP5"
Cluster Configuration:
NAME STATUS ROLES AGE VERSION
ip-3-3-8-8 Ready control-plane,etcd,master 23m v1.30.1+k3s-cff6f7aa
Config.yaml:
node-external-ip: 3.3.8.8
token: YOUR_TOKEN_HERE
write-kubeconfig-mode: 644
debug: true
cluster-init: true
embedded-registry: true
Reproduction
$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ COMMIT=cff6f7aa1d7987a658b030b2bc69df7c25f515c8
$ sudo INSTALL_K3S_COMMIT=$COMMIT INSTALL_K3S_EXEC=server ./install-k3s.sh
$ kg no,po -A
$ vim pvc.yaml
$ vim pv-pod.yaml
$ k apply -f pvc.yaml
$ kg pvc -A
$ k apply -f pv-pod.yaml
$ kg po,pv,pvc -A
$ k delete -f pv-pod.yaml
$ k delete -f pvc.yaml
$ kg pv -A
$ kubectl get deployment -n kube-system local-path-provisioner -o=jsonpath='{$.spec.template.spec.containers[:1].image}'
$ k apply -f pvc.yaml
$ k apply -f pv-pod.yaml
$ kg pv,pod,pvc -A
$ k delete -f pvc.yaml; sleep 40; k delete -f pv-pod.yaml
$ kg pv,pvc -A
$ k apply -f pvc.yaml
$ k apply -f pv-pod.yaml
$ k delete -f pvc.yaml
$ k delete -f pv-pod.yaml
$ kg pvc,pv -A
Results:
No dangling leftover volume claims, they're reclaimed appropriately.
$ cat pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
namespace: kube-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Mi
$ cat pv-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
namespace: kube-system
spec:
containers:
- name: debian
image: digitalocean/doks-debug
command: ["sleep", "infinity"]
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-pvc