csi-driver-nfs icon indicating copy to clipboard operation
csi-driver-nfs copied to clipboard

csi-driver-nfs stops working and can't provision after microk8s node restart

Open yingding opened this issue 3 years ago • 10 comments

What happened: I have a 3 nodes microk8s cluster, and works fine with v4.1.0 helm3 charts of csi-driver-nfs. I installed the csi-driver-nfs following the instruction from https://microk8s.io/docs/nfs, after i restarted two worker nodes of microk8s, the csi-driver-nfs stops working. There is no error in the logs of csi-driver-nfs controller and nodes

What you expected to happen: After the microk8s worker nodes restart, the csi-driver-nfs shall function further

How to reproduce it:

  1. make a microk8s cluster with 3 nodes
  2. deploy the csi-driver-nfs with helm3 chart v4.1.0, and connect successfully to a nfs export
  3. restart two nodes of microk8s cluster
  4. Wait for the two microk8s nodes restarted successfully and the csi-driver-nfs controller and node pods also restarted successfully
  5. try to use a pvc to provision a new pv

Anything else we need to know?: I found the work arround is to delete the helm3 csi-driver-nfs chart and redeploy the helm3 chart with the instruction from https://microk8s.io/docs/nfs.

I also tried to restart all the pod with kubectl delete pod --selector app.kubernetes.io/name=csi-driver-nfs --namespace kube-system, and it doesn't work.

There was no error in the logs:

microk8s kubectl logs --selector app=csi-nfs-controller -n kube-system -c nfs
microk8s kubectl logs --selector app=csi-nfs-node -n kube-system -c nfs

Environment:

  • CSI Driver version: 4.1.0
  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"f66044f4361b9f1f96f0053dd46cb7dce5e990a8", GitTreeState:"clean", BuildDate:"2022-06-17T22:28:26Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.12-3+6937f71915b56b", GitCommit:"6937f71915b56b6004162b7c7b3f11f196100d44", GitTreeState:"clean", BuildDate:"2022-04-28T11:11:24Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
  • OS (e.g. from /etc/os-release): NAME="Ubuntu" VERSION="20.04.3 LTS (Focal Fossa)"
  • Kernel (e.g. uname -a): Linux 5.4.0-120-generic Ubuntu SMP Fri Jun 10 13:40:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: snap, microk8s
  • Others:

yingding avatar Jun 22 '22 18:06 yingding

if csi-nfs-controller is running on one worker node which is restarting, the csi-nfs-controller would not accept any request, so that's by design.

andyzhangx avatar Jun 23 '22 01:06 andyzhangx

@andyzhangx Thanks for your feedback.

if csi-nfs-controller is running on one worker node which is restarting, the csi-nfs-controller would not accept any request, so that's by design.

But what i mean is even after the csi-nfs-controller also restarted from one of the microk8s worker nodes, it should be able to accept any request again? In my case, is even after the successful restart of all csi-nfs-controller and node container, the csi-nfs-controller just don't function any more and without any error in logs. I check the pod readiness with kubectl wait pod --selector app.kubernetes.io/name=csi-driver-nfs --for condition=ready --namespace kube-system.

I need to delete the helm chart of csi-driver-nfs and redeploy it every time when i restart my mircok8s worker nodes where the csi-nfs-controller resides, restart all csi-driver-nfs pods doesn't work. The csi-driver-nfs pods/containers must be recreated every time after the microk8s worker nodes restarts. Is there maybe any caching effects?

yingding avatar Jun 23 '22 09:06 yingding

@yingding not sure, I think you should get kubectl logs csi-nfs-controller-56bfddd689-dh5tk -c nfs -n kube-system -c csi-provisioner to take a look if there is any PVC provision.

andyzhangx avatar Jun 23 '22 09:06 andyzhangx

@andyzhangx Thanks, next time wenn i run into this issue. I will try kubectl logs csi-nfs-controller-56bfddd689-dh5tk -c nfs -n kube-system -c csi-provisioner to see if -c csi-provisioner option gives me further log infos.

yingding avatar Jun 23 '22 09:06 yingding

@andyzhangx The same issue happens again. From kubectl logs csi-nfs-controller-656d5f9d5b-9xg9p -c nfs -n kube-system -c csi-provisioner I got the following error output.

...
I0628 10:24:41.136751       1 controller.go:1337] provision "kubeflow/test-volume" class "kubeflow-nfs-csi": started
I0628 10:24:41.136963       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "kubeflow/test-volume"
I0628 10:24:41.137741       1 controller.go:1075] Final error received, removing PVC f5630f38-b6e2-4df3-b1a1-57f23b7844fc from claims in progress
W0628 10:24:41.137762       1 controller.go:934] Retrying syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc", failure 6
E0628 10:24:41.137787       1 controller.go:957] error syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc": failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/name" in storage class
I0628 10:24:41.137800       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/name" in storage class
I0628 10:25:45.138886       1 controller.go:1337] provision "kubeflow/test-volume" class "kubeflow-nfs-csi": started
I0628 10:25:45.139083       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "kubeflow/test-volume"
I0628 10:25:45.139829       1 controller.go:1075] Final error received, removing PVC f5630f38-b6e2-4df3-b1a1-57f23b7844fc from claims in progress
W0628 10:25:45.139849       1 controller.go:934] Retrying syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc", failure 7
E0628 10:25:45.139873       1 controller.go:957] error syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc": failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/namespace" in storage class
I0628 10:25:45.139890       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/namespace" in storage class
I0628 10:27:53.141015       1 controller.go:1337] provision "kubeflow/test-volume" class "kubeflow-nfs-csi": started
I0628 10:27:53.141197       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "kubeflow/test-volume"
I0628 10:27:53.142196       1 controller.go:1075] Final error received, removing PVC f5630f38-b6e2-4df3-b1a1-57f23b7844fc from claims in progress
W0628 10:27:53.142218       1 controller.go:934] Retrying syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc", failure 8
E0628 10:27:53.142242       1 controller.go:957] error syncing claim "f5630f38-b6e2-4df3-b1a1-57f23b7844fc": failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/name" in storage class
I0628 10:27:53.142284       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"kubeflow", Name:"test-volume", UID:"f5630f38-b6e2-4df3-b1a1-57f23b7844fc", APIVersion:"v1", ResourceVersion:"68226858", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "kubeflow-nfs-csi": rpc error: code = InvalidArgument desc = invalid parameter "csi.storage.k8s.io/pvc/name" in storage class

Can you give me a hint?

yingding avatar Jun 28 '22 10:06 yingding

ok. I fall back to helm3 chart v4.0.0 and every thing works now. It is an issue of v4.1.0.

yingding avatar Jun 28 '22 11:06 yingding

v4.1.0 is master branch, it's not released yet

andyzhangx avatar Jun 28 '22 14:06 andyzhangx

I will try release v4.1.0 version late this week.

andyzhangx avatar Jun 28 '22 14:06 andyzhangx

Thanks for clarifying, I thought initially I have to use v4.1.0 since it supports 1.20+ and my k8s is 1.21.12

yingding avatar Jun 29 '22 07:06 yingding

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 27 '22 07:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 27 '22 07:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 26 '22 08:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 26 '22 08:11 k8s-ci-robot