pso-csi icon indicating copy to clipboard operation
pso-csi copied to clipboard

pso-csi-node pods restarts frequently

Open rpatil72 opened this issue 3 years ago • 17 comments

$ kubectl get pods -o wide -n storage
NAME                                         READY   STATUS    RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES
pso-csi-controller-0                         6/6     Running   6          6h1m    172.20.2.53    init22b-8       <none>           <none>
pso-csi-node-24qd8                           2/3     Running   53         6h1m    10.64.112.52   k8s-worker-02   <none>           <none>
pso-csi-node-mfjg7                           3/3     Running   0          6h1m    10.64.230.41   init22b-8       <none>           <none>
pso-csi-node-ncqmn                           3/3     Running   0          6h1m    10.64.230.40   init22b-7       <none>           <none>
pso-csi-node-p6b97                           2/3     Running   56         6h1m    10.64.112.51   k8s-worker-01   <none>           <none>
pso-db-0-0                                   1/1     Running   0          6h1m    172.20.1.47    init22b-7       <none>           <none>
pso-db-1-0                                   1/1     Running   0          6h1m    172.20.2.54    init22b-8       <none>           <none>
pso-db-2-0                                   1/1     Running   0          6h1m    172.20.3.4     k8s-worker-01   <none>           <none>
pso-db-3-0                                   1/1     Running   0          6h1m    172.20.4.5     k8s-worker-02   <none>           <none>
pso-db-4-0                                   1/1     Running   0          6h1m    172.20.1.48    init22b-7       <none>           <none>
pso-db-cockroach-operator-5f494b6b5b-wpskw   1/1     Running   0          6h1m    172.20.1.45    init22b-7       <none>           <none>
pso-db-deployer-5df48fb79b-5hj82             1/1     Running   0          6h1m    172.20.1.46    init22b-7       <none>           <none>
pso-explorer-84d6cf69f9-lgqjc                2/2     Running   0          5h35m   172.20.2.57    init22b-8       <none>           <none>
psox-agent-2bj4j                             1/1     Running   0          5h51m   172.20.3.5     k8s-worker-01   <none>           <none>
psox-agent-2vx4x                             1/1     Running   0          5h51m   172.20.4.7     k8s-worker-02   <none>           <none>
psox-agent-pc5bh                             1/1     Running   0          5h51m   172.20.2.56    init22b-8       <none>           <none>
psox-agent-z2hnz                             1/1     Running   0          5h51m   172.20.1.49    init22b-7       <none>           <none>
psox-agent-zqlnl                             1/1     Running   1          5h51m   172.20.0.9     k8s-master-01   <none>           <none>

node-driver-registrar

W0508 00:42:08.409027   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:18.409174   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:28.409061   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:38.409045   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:48.409081   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:58.408966   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:08.408997   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:18.409026   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:28.409013   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:38.409098   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
I0508 00:43:42.144113   56628 node_register.go:51] Starting Registration Server at: /registration/pure-csi-reg.sock
I0508 00:43:42.144325   56628 node_register.go:60] Registration Server started at: /registration/pure-csi-reg.sock
I0508 00:43:43.563343   56628 main.go:77] Received GetInfo call: &InfoRequest{}
I0508 00:43:43.584165   56628 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E0508 00:49:40.845200   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:00:54.861780   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:05:12.763918   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:14:36.757668   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:33:18.765391   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:28:59.840505   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:46:42.755703   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:57:45.872492   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.

pso-csi-container

Original NodeID: k8s-worker-01
processIPHost
time="2021-05-08T06:50:10Z" level=error msg="Failed to retrieve configmap from cluster." configmap="&ConfigMap{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Data:map[string]string{},BinaryData:map[string][]byte{},}" configmap_name=pure-csi-container-configmap error="an empty namespace may not be set when a resource name is provided"
time="2021-05-08T06:50:11Z" level=info msg="Starting server" debug=false endpoint="unix:///csi/csi.sock" nodeID=k8s-worker-01 span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Created listener" addr=//csi/csi.sock protocol=unix span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Listening for connections" address=//csi/csi.sock network=unix span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Parsing ISCSI target ranges for CIDR filter." ISCSITargetRanges=
time="2021-05-08T06:50:11Z" level=info msg="CIDR Filters constructed." CIDRs="[]"
(/go/src/pso.purestorage.com/k8s/vendor/pso.purestorage.com/nsm/pkg/datastores/services/cockroach/collection.go:166)
[2021-05-08 06:50:12]  sql: no rows in result set 
time="2021-05-08T06:51:00Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:00Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:00Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:02Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:02Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:02Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:04Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:04Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:04Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:06Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:06Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:06Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:08Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:08Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:08Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:10Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:10Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:10Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:12Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1

rpatil72 avatar May 08 '21 06:05 rpatil72

Hi, @rpatil72. Can you turn on the debug flag in the values.yaml, run the script here to collect the log and send to [email protected]?

Thanks.

dsupure avatar May 09 '21 04:05 dsupure

Thanks @dsupure for quick response. Have enabled debug in values.yaml and have sent the collected logs to that email

rpatil72 avatar May 09 '21 05:05 rpatil72

Thank you @rpatil72 . We got the logs. Our support team is debugging the logs. We will contact you shortly once we find the issue. Thanks for your patience.

dsupure avatar May 12 '21 22:05 dsupure

Hi, @dsupure @rpatil72. Was there a solution to this problem before we need to contact support directly?

zatricky avatar Jun 10 '21 09:06 zatricky

Hi @zatricky : My team is investigating it. We will give you an update tomorrow. Thanks for your patient.

dsupure avatar Jun 11 '21 01:06 dsupure

hello @rpatil72 @zatricky after investigation we probably need more info from you:

  • did you start with PSO 5 or 6?
  • did a fresh installed PSO 6 use to work fine? if yes what is the env setup? e.g. k8s, os versions
  • though it restarts frequently, is it functioning or not?
  • did those pod restart before or after it was ready?

From the latest log I got from the email

   1 NAME                                         READY   STATUS    RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES
   2 pso-csi-controller-0                         5/6     Running   0          6m26s   172.20.2.58    init22b-8       <none>           <none>
   3 pso-csi-node-dndw2                           2/3     Running   1          6m35s   10.64.112.52   k8s-worker-02   <none>           <none>
   4 pso-csi-node-hvg85                           2/3     Running   0          6m26s   10.64.230.41   init22b-8       <none>           <none>
   5 pso-csi-node-p2xjb                           3/3     Running   0          6m35s   10.64.230.40   init22b-7       <none>           <none>
   6 pso-csi-node-xx56s                           2/3     Running   1          6m35s   10.64.112.51   k8s-worker-01   <none>           <none>
   7 pso-db-0-0                                   1/1     Running   0          2d17h   172.20.1.47    init22b-7       <none>           <none>
   8 pso-db-1-0                                   1/1     Running   0          2d17h   172.20.2.54    init22b-8       <none>           <none>
   9 pso-db-2-0                                   1/1     Running   0          2d17h   172.20.3.4     k8s-worker-01   <none>           <none>
  10 pso-db-3-0                                   1/1     Running   0          2d17h   172.20.4.5     k8s-worker-02   <none>           <none>
  11 pso-db-4-0                                   1/1     Running   0          2d17h   172.20.1.48    init22b-7       <none>           <none>
  • pso pods started for 6 mins while the db pods started 2 days ago, did you just delete those pod? or something else was done?
  • from this log we didn't find anything useful, if you could reach the state in the first post on this page (controller is ready, only 2 node pods keep restarting) then could you run the log collection script again and send us the log? since 6 minutes probably is too short to grab the logs, sometimes it could take longer.
  • meanwhile did you try to do a helm uninstall and then reinstall PSO 6 again?

pure-jliao avatar Jun 11 '21 19:06 pure-jliao

Hi, @pure-jliao. Our org has sent in a support request. Note that I have similar errors to those @rpatil72 submitted. In our case the original cause was a hypervisor cluster failure affecting many unrelated hosts, including an entire kubernetes cluster making use of the storage. We had already recovered all core services. I was hoping that a resolution was found for @rpatil72 so that we could at least know where to look next before submitting that support request (which has of course already been sent by now).

zatricky avatar Jun 11 '21 21:06 zatricky

thanks @zatricky , FYI we do found something wrong related to this error log line on our side Failed to retrieve configmap from cluster but it shouldn't be the root cause of the problem stated here and kind of misleading, so that's why we need more info here in my previous post.

pure-jliao avatar Jun 11 '21 22:06 pure-jliao

@rpatil72 : can you help to answer the questions from @pure-jliao ? Thanks.

dsupure avatar Jun 15 '21 23:06 dsupure

@pure-jliao please find the replies below:

  • did you start with PSO 5 or 6? Started with PSO 6

  • did a fresh installed PSO 6 use to work fine? if yes what is the env setup? e.g. k8s, os versions This was a new setup with PSO 6; k8s 1.21 on Ubuntu 20.04.2 LTS / 5.4.0-72-generic, CRI is docker://20.10.2

  • though it restarts frequently, is it functioning or not? It was not functioning

  • did those pod restart before or after it was ready? it not got ready

  • pso pods started for 6 mins while the db pods started 2 days ago, did you just delete those pod? or something else was done? tried to redeploy couple of times

  • from this log we didn't find anything useful, if you could reach the state in the first post on this page (controller is ready, only 2 node pods keep restarting) then could you run the log collection script again and send us the log? since 6 minutes probably is too short to grab the logs, sometimes it could take longer. I can try this, but I switched to https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner/blob/master/charts/nfs-subdir-external-provisioner/README.md and this work fine in the same environment

  • meanwhile did you try to do a helm uninstall and then reinstall PSO 6 again? Yes but it did not help then

rpatil72 avatar Jun 16 '21 16:06 rpatil72

@rpatil72 thanks for applying, one thing to note right now we haven't tested PSO 6 on k8s 1.21 and it's not officially supported yet. but not 100% sure it's caused by a newer and unstable version of k8s, so if possible do you have any chance to try a fresh install on k8s 1.20? meanwhile in your first post, only 2 node pods are restarting and all other PSO pods are healthy, if you could reach that state again, could you try to grab the logs again? it might be a little bit hard to get logs if the pods keep crashing. also could you try to run PSO without the explorer to rule out some other effects?

and it still cannot be resolved we could schedule a call cc: @dsupure @dingyin

pure-jliao avatar Jun 17 '21 18:06 pure-jliao

Facing the same issue.

bash-3.2$ helm list -n pure-pso
NAME    	NAMESPACE	REVISION	UPDATED                              	STATUS  	CHART         	APP VERSION
pure-pso	pure-pso 	1       	2021-06-30 01:00:21.279272 +0200 CEST	deployed	pure-pso-6.1.1	6.1.1
bash-3.2$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:11:29Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
bash-3.2$

Any updates?

smrutimandal avatar Jun 29 '21 23:06 smrutimandal

@smrutimandal sorry we haven't narrowed down the problem here, there could be a lot of things causing PSO pod failures. So please collect the logs and contact support team for more investigation.

pure-jliao avatar Jun 30 '21 02:06 pure-jliao

Hi,

Thanks for getting back to me. I am in the process of getting my support contract in order. In the meantime, I found that that the following feature-gate defaults have changed between 1.20 and 1.21.

Kube API Server v1.20: CSIServiceAccountToken=true|false (ALPHA - default=false) CSIStorageCapacity=true|false (ALPHA - default=false)

Kube API Server v1.21: CSIServiceAccountToken=true|false (BETA - default=true) CSIStorageCapacity=true|false (BETA - default=true)

Can these changes have any impact on the liveness-probes?

Any help is greatly appreciated.

Thanks

smrutimandal avatar Jul 14 '21 13:07 smrutimandal

@smrutimandal: As @pure-jliao mentioned, there are many possibilities and it's hard to troubleshoot your problem via github. Please contact pure support and we will create a support case for you. We will follow up there. Thank you.

dsupure avatar Jul 14 '21 16:07 dsupure

Hi @dsupure ,

I understand that. If compatibility is really an issue though, the https://github.com/purestorage/pso-csi#environments-supported section, assuring support for v1.17+, is a bit confusing.

I will get in touch with support.

Thanks

smrutimandal avatar Jul 15 '21 06:07 smrutimandal

1.21 is supposed to be supported now, do you have any update ? I get the same liveness probe errors since K8s Upgrade (1.19 is EoL), though the setup seems to work OK (no restart -to be monitored- and volumes get attached)


time="2022-01-26T23:08:16Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2022-01-26T23:08:16Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2022-01-26T23:08:16Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"

mscbpi avatar Jan 26 '22 18:01 mscbpi