pso-csi
pso-csi copied to clipboard
pso-csi-node pods restarts frequently
$ kubectl get pods -o wide -n storage
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pso-csi-controller-0 6/6 Running 6 6h1m 172.20.2.53 init22b-8 <none> <none>
pso-csi-node-24qd8 2/3 Running 53 6h1m 10.64.112.52 k8s-worker-02 <none> <none>
pso-csi-node-mfjg7 3/3 Running 0 6h1m 10.64.230.41 init22b-8 <none> <none>
pso-csi-node-ncqmn 3/3 Running 0 6h1m 10.64.230.40 init22b-7 <none> <none>
pso-csi-node-p6b97 2/3 Running 56 6h1m 10.64.112.51 k8s-worker-01 <none> <none>
pso-db-0-0 1/1 Running 0 6h1m 172.20.1.47 init22b-7 <none> <none>
pso-db-1-0 1/1 Running 0 6h1m 172.20.2.54 init22b-8 <none> <none>
pso-db-2-0 1/1 Running 0 6h1m 172.20.3.4 k8s-worker-01 <none> <none>
pso-db-3-0 1/1 Running 0 6h1m 172.20.4.5 k8s-worker-02 <none> <none>
pso-db-4-0 1/1 Running 0 6h1m 172.20.1.48 init22b-7 <none> <none>
pso-db-cockroach-operator-5f494b6b5b-wpskw 1/1 Running 0 6h1m 172.20.1.45 init22b-7 <none> <none>
pso-db-deployer-5df48fb79b-5hj82 1/1 Running 0 6h1m 172.20.1.46 init22b-7 <none> <none>
pso-explorer-84d6cf69f9-lgqjc 2/2 Running 0 5h35m 172.20.2.57 init22b-8 <none> <none>
psox-agent-2bj4j 1/1 Running 0 5h51m 172.20.3.5 k8s-worker-01 <none> <none>
psox-agent-2vx4x 1/1 Running 0 5h51m 172.20.4.7 k8s-worker-02 <none> <none>
psox-agent-pc5bh 1/1 Running 0 5h51m 172.20.2.56 init22b-8 <none> <none>
psox-agent-z2hnz 1/1 Running 0 5h51m 172.20.1.49 init22b-7 <none> <none>
psox-agent-zqlnl 1/1 Running 1 5h51m 172.20.0.9 k8s-master-01 <none> <none>
node-driver-registrar
W0508 00:42:08.409027 56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:18.409174 56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:28.409061 56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:38.409045 56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:48.409081 56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:58.408966 56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:08.408997 56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:18.409026 56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:28.409013 56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:38.409098 56628 connection.go:170] Still connecting to unix:///csi/csi.sock
I0508 00:43:42.144113 56628 node_register.go:51] Starting Registration Server at: /registration/pure-csi-reg.sock
I0508 00:43:42.144325 56628 node_register.go:60] Registration Server started at: /registration/pure-csi-reg.sock
I0508 00:43:43.563343 56628 main.go:77] Received GetInfo call: &InfoRequest{}
I0508 00:43:43.584165 56628 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E0508 00:49:40.845200 56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:00:54.861780 56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:05:12.763918 56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:14:36.757668 56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:33:18.765391 56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:28:59.840505 56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:46:42.755703 56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:57:45.872492 56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
pso-csi-container
Original NodeID: k8s-worker-01
processIPHost
time="2021-05-08T06:50:10Z" level=error msg="Failed to retrieve configmap from cluster." configmap="&ConfigMap{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Data:map[string]string{},BinaryData:map[string][]byte{},}" configmap_name=pure-csi-container-configmap error="an empty namespace may not be set when a resource name is provided"
time="2021-05-08T06:50:11Z" level=info msg="Starting server" debug=false endpoint="unix:///csi/csi.sock" nodeID=k8s-worker-01 span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Created listener" addr=//csi/csi.sock protocol=unix span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Listening for connections" address=//csi/csi.sock network=unix span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Parsing ISCSI target ranges for CIDR filter." ISCSITargetRanges=
time="2021-05-08T06:50:11Z" level=info msg="CIDR Filters constructed." CIDRs="[]"
(/go/src/pso.purestorage.com/k8s/vendor/pso.purestorage.com/nsm/pkg/datastores/services/cockroach/collection.go:166)
[2021-05-08 06:50:12] sql: no rows in result set
time="2021-05-08T06:51:00Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:00Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:00Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:02Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:02Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:02Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:04Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:04Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:04Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:06Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:06Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:06Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:08Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:08Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:08Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:10Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:10Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:10Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:12Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
Hi, @rpatil72. Can you turn on the debug flag in the values.yaml, run the script here to collect the log and send to [email protected]?
Thanks.
Thanks @dsupure for quick response. Have enabled debug in values.yaml and have sent the collected logs to that email
Thank you @rpatil72 . We got the logs. Our support team is debugging the logs. We will contact you shortly once we find the issue. Thanks for your patience.
Hi, @dsupure @rpatil72. Was there a solution to this problem before we need to contact support directly?
Hi @zatricky : My team is investigating it. We will give you an update tomorrow. Thanks for your patient.
hello @rpatil72 @zatricky after investigation we probably need more info from you:
- did you start with PSO 5 or 6?
- did a fresh installed PSO 6 use to work fine? if yes what is the env setup? e.g. k8s, os versions
- though it restarts frequently, is it functioning or not?
- did those pod restart before or after it was ready?
From the latest log I got from the email
1 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
2 pso-csi-controller-0 5/6 Running 0 6m26s 172.20.2.58 init22b-8 <none> <none>
3 pso-csi-node-dndw2 2/3 Running 1 6m35s 10.64.112.52 k8s-worker-02 <none> <none>
4 pso-csi-node-hvg85 2/3 Running 0 6m26s 10.64.230.41 init22b-8 <none> <none>
5 pso-csi-node-p2xjb 3/3 Running 0 6m35s 10.64.230.40 init22b-7 <none> <none>
6 pso-csi-node-xx56s 2/3 Running 1 6m35s 10.64.112.51 k8s-worker-01 <none> <none>
7 pso-db-0-0 1/1 Running 0 2d17h 172.20.1.47 init22b-7 <none> <none>
8 pso-db-1-0 1/1 Running 0 2d17h 172.20.2.54 init22b-8 <none> <none>
9 pso-db-2-0 1/1 Running 0 2d17h 172.20.3.4 k8s-worker-01 <none> <none>
10 pso-db-3-0 1/1 Running 0 2d17h 172.20.4.5 k8s-worker-02 <none> <none>
11 pso-db-4-0 1/1 Running 0 2d17h 172.20.1.48 init22b-7 <none> <none>
- pso pods started for 6 mins while the db pods started 2 days ago, did you just delete those pod? or something else was done?
- from this log we didn't find anything useful, if you could reach the state in the first post on this page (controller is ready, only 2 node pods keep restarting) then could you run the log collection script again and send us the log? since 6 minutes probably is too short to grab the logs, sometimes it could take longer.
- meanwhile did you try to do a helm uninstall and then reinstall PSO 6 again?
Hi, @pure-jliao. Our org has sent in a support request. Note that I have similar errors to those @rpatil72 submitted. In our case the original cause was a hypervisor cluster failure affecting many unrelated hosts, including an entire kubernetes cluster making use of the storage. We had already recovered all core services. I was hoping that a resolution was found for @rpatil72 so that we could at least know where to look next before submitting that support request (which has of course already been sent by now).
thanks @zatricky , FYI we do found something wrong related to this error log line on our side Failed to retrieve configmap from cluster
but it shouldn't be the root cause of the problem stated here and kind of misleading, so that's why we need more info here in my previous post.
@rpatil72 : can you help to answer the questions from @pure-jliao ? Thanks.
@pure-jliao please find the replies below:
-
did you start with PSO 5 or 6? Started with PSO 6
-
did a fresh installed PSO 6 use to work fine? if yes what is the env setup? e.g. k8s, os versions This was a new setup with PSO 6; k8s 1.21 on Ubuntu 20.04.2 LTS / 5.4.0-72-generic, CRI is docker://20.10.2
-
though it restarts frequently, is it functioning or not? It was not functioning
-
did those pod restart before or after it was ready? it not got ready
-
pso pods started for 6 mins while the db pods started 2 days ago, did you just delete those pod? or something else was done? tried to redeploy couple of times
-
from this log we didn't find anything useful, if you could reach the state in the first post on this page (controller is ready, only 2 node pods keep restarting) then could you run the log collection script again and send us the log? since 6 minutes probably is too short to grab the logs, sometimes it could take longer. I can try this, but I switched to https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner/blob/master/charts/nfs-subdir-external-provisioner/README.md and this work fine in the same environment
-
meanwhile did you try to do a helm uninstall and then reinstall PSO 6 again? Yes but it did not help then
@rpatil72 thanks for applying, one thing to note right now we haven't tested PSO 6 on k8s 1.21 and it's not officially supported yet. but not 100% sure it's caused by a newer and unstable version of k8s, so if possible do you have any chance to try a fresh install on k8s 1.20? meanwhile in your first post, only 2 node pods are restarting and all other PSO pods are healthy, if you could reach that state again, could you try to grab the logs again? it might be a little bit hard to get logs if the pods keep crashing. also could you try to run PSO without the explorer to rule out some other effects?
and it still cannot be resolved we could schedule a call cc: @dsupure @dingyin
Facing the same issue.
bash-3.2$ helm list -n pure-pso
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
pure-pso pure-pso 1 2021-06-30 01:00:21.279272 +0200 CEST deployed pure-pso-6.1.1 6.1.1
bash-3.2$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:11:29Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
bash-3.2$
Any updates?
@smrutimandal sorry we haven't narrowed down the problem here, there could be a lot of things causing PSO pod failures. So please collect the logs and contact support team for more investigation.
Hi,
Thanks for getting back to me. I am in the process of getting my support contract in order. In the meantime, I found that that the following feature-gate defaults have changed between 1.20 and 1.21.
Kube API Server v1.20: CSIServiceAccountToken=true|false (ALPHA - default=false) CSIStorageCapacity=true|false (ALPHA - default=false)
Kube API Server v1.21: CSIServiceAccountToken=true|false (BETA - default=true) CSIStorageCapacity=true|false (BETA - default=true)
Can these changes have any impact on the liveness-probes?
Any help is greatly appreciated.
Thanks
@smrutimandal: As @pure-jliao mentioned, there are many possibilities and it's hard to troubleshoot your problem via github. Please contact pure support and we will create a support case for you. We will follow up there. Thank you.
Hi @dsupure ,
I understand that. If compatibility is really an issue though, the https://github.com/purestorage/pso-csi#environments-supported section, assuring support for v1.17+, is a bit confusing.
I will get in touch with support.
Thanks
1.21 is supposed to be supported now, do you have any update ? I get the same liveness probe errors since K8s Upgrade (1.19 is EoL), though the setup seems to work OK (no restart -to be monitored- and volumes get attached)
time="2022-01-26T23:08:16Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2022-01-26T23:08:16Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2022-01-26T23:08:16Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"