csi-driver-nfs
csi-driver-nfs copied to clipboard
csi-driver-nfs stops working and can't provision after microk8s node restart
What happened: I have a 3 nodes microk8s cluster, and works fine with v4.1.0 helm3 charts of csi-driver-nfs. I installed the csi-driver-nfs following the instruction from https://microk8s.io/docs/nfs, after i restarted two worker nodes of microk8s, the csi-driver-nfs stops working. There is no error in the logs of csi-driver-nfs controller and nodes
What you expected to happen: After the microk8s worker nodes restart, the csi-driver-nfs shall function further
How to reproduce it:
- make a microk8s cluster with 3 nodes
- deploy the csi-driver-nfs with helm3 chart v4.1.0, and connect successfully to a nfs export
- restart two nodes of microk8s cluster
- Wait for the two microk8s nodes restarted successfully and the csi-driver-nfs controller and node pods also restarted successfully
- try to use a pvc to provision a new pv
Anything else we need to know?: I found the work arround is to delete the helm3 csi-driver-nfs chart and redeploy the helm3 chart with the instruction from https://microk8s.io/docs/nfs.
I also tried to restart all the pod with kubectl delete pod --selector app.kubernetes.io/name=csi-driver-nfs --namespace kube-system, and it doesn't work.
There was no error in the logs:
microk8s kubectl logs --selector app=csi-nfs-controller -n kube-system -c nfs
microk8s kubectl logs --selector app=csi-nfs-node -n kube-system -c nfs
Environment:
- CSI Driver version: 4.1.0
- Kubernetes version (use
kubectl version): Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"f66044f4361b9f1f96f0053dd46cb7dce5e990a8", GitTreeState:"clean", BuildDate:"2022-06-17T22:28:26Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.12-3+6937f71915b56b", GitCommit:"6937f71915b56b6004162b7c7b3f11f196100d44", GitTreeState:"clean", BuildDate:"2022-04-28T11:11:24Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"} - OS (e.g. from /etc/os-release): NAME="Ubuntu" VERSION="20.04.3 LTS (Focal Fossa)"
- Kernel (e.g.
uname -a): Linux 5.4.0-120-generic Ubuntu SMP Fri Jun 10 13:40:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux - Install tools: snap, microk8s
- Others:
if csi-nfs-controller is running on one worker node which is restarting, the csi-nfs-controller would not accept any request, so that's by design.
@andyzhangx Thanks for your feedback.
if
csi-nfs-controlleris running on one worker node which is restarting, thecsi-nfs-controllerwould not accept any request, so that's by design.
But what i mean is even after the csi-nfs-controller also restarted from one of the microk8s worker nodes, it should be able to accept any request again? In my case, is even after the successful restart of all csi-nfs-controller and node container, the csi-nfs-controller just don't function any more and without any error in logs. I check the pod readiness with kubectl wait pod --selector app.kubernetes.io/name=csi-driver-nfs --for condition=ready --namespace kube-system.
I need to delete the helm chart of csi-driver-nfs and redeploy it every time when i restart my mircok8s worker nodes where the csi-nfs-controller resides, restart all csi-driver-nfs pods doesn't work. The csi-driver-nfs pods/containers must be recreated every time after the microk8s worker nodes restarts. Is there maybe any caching effects?
@yingding not sure, I think you should get kubectl logs csi-nfs-controller-56bfddd689-dh5tk -c nfs -n kube-system -c csi-provisioner to take a look if there is any PVC provision.
@andyzhangx Thanks, next time wenn i run into this issue. I will try kubectl logs csi-nfs-controller-56bfddd689-dh5tk -c nfs -n kube-system -c csi-provisioner to see if -c csi-provisioner option gives me further log infos.
@andyzhangx The same issue happens again. From kubectl logs csi-nfs-controller-656d5f9d5b-9xg9p -c nfs -n kube-system -c csi-provisioner
I got the following error output.
...
I0628 10:24:41.136751 1 controller.go:1337] provision "kubeflow/test-volume" class "kubeflow-nfs-csi": started
I0628 10:24:41.136963 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "kubeflow/test-volume"
I0628 10:24:41.137741 1 controller.go:1075] Final error received, removing PVC f5630f38-b6e2-4df3-b1a1-57f23b7844fc from claims in progress
W0628 10:24:41.137762 1 controller.go:934] Retrying syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc", failure 6
E0628 10:24:41.137787 1 controller.go:957] error syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc": failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/name" in storage class
I0628 10:24:41.137800 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/name" in storage class
I0628 10:25:45.138886 1 controller.go:1337] provision "kubeflow/test-volume" class "kubeflow-nfs-csi": started
I0628 10:25:45.139083 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "kubeflow/test-volume"
I0628 10:25:45.139829 1 controller.go:1075] Final error received, removing PVC f5630f38-b6e2-4df3-b1a1-57f23b7844fc from claims in progress
W0628 10:25:45.139849 1 controller.go:934] Retrying syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc", failure 7
E0628 10:25:45.139873 1 controller.go:957] error syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc": failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/namespace" in storage class
I0628 10:25:45.139890 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/namespace" in storage class
I0628 10:27:53.141015 1 controller.go:1337] provision "kubeflow/test-volume" class "kubeflow-nfs-csi": started
I0628 10:27:53.141197 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "kubeflow/test-volume"
I0628 10:27:53.142196 1 controller.go:1075] Final error received, removing PVC f5630f38-b6e2-4df3-b1a1-57f23b7844fc from claims in progress
W0628 10:27:53.142218 1 controller.go:934] Retrying syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc", failure 8
E0628 10:27:53.142242 1 controller.go:957] error syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc": failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/name" in storage class
I0628 10:27:53.142284 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/name" in storage class
Can you give me a hint?
ok. I fall back to helm3 chart v4.0.0 and every thing works now. It is an issue of v4.1.0.
v4.1.0 is master branch, it's not released yet
I will try release v4.1.0 version late this week.
Thanks for clarifying, I thought initially I have to use v4.1.0 since it supports 1.20+ and my k8s is 1.21.12
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.