gloo
gloo copied to clipboard
Latest 1.11.x causing nodes to crash on EKS
Gloo Edge Version
1.11.x (latest stable)
Kubernetes Version
1.22.x
Describe the bug
As soon as Gloo Edge Enterprise 1.11.20
is deployed on a fresh EKS cluster, some of the EKS nodes will start to become non-response and they go into a NotReady
state.
Node version points to,
Kernel Version: 5.4.190-107.353.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.13
Kubelet Version: v1.22.6-eks-7d6806
At a quick glance it appears to be high IO load
top - 08:51:47 up 2:39, 2 users, load average: 94.71, 35.21, 12.90
Tasks: 156 total, 4 running, 92 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 50.1 sy, 0.0 ni, 1.2 id, 47.4 wa, 0.0 hi, 0.0 si, 0.5 st
KiB Mem : 3965408 total, 107772 free, 3805144 used, 52492 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 12940 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
553 root 20 0 0 0 0 R 100.0 0.0 1:39.41 kswapd0
5223 root 20 0 747964 9100 0 S 22.9 0.2 0:04.78 kube-proxy
20017 10101 20 0 2287760 12972 0 D 6.5 0.3 0:01.42 envoy
2923 root 20 0 2027528 55764 0 S 5.1 1.4 1:22.81 dockerd
21068 101 20 0 786432 15948 0 S 4.4 0.4 0:01.04 gloo-fed-apiser
5707 root 20 0 759972 23320 0 D 3.8 0.6 0:04.83 aws-k8s-agent
21167 472 20 0 783872 35588 0 S 3.2 0.9 0:05.97 grafana-server
21655 root 20 0 1100720 3060 0 D 3.1 0.1 0:00.64 runc:[2:INIT]
20428 10101 20 0 775252 43580 0 S 3.0 1.1 0:01.87 rate-limit
20530 10101 20 0 8320340 3.1g 0 R 3.0 82.0 0:11.84 discovery
3139 root 20 0 1821096 38004 0 S 2.9 1.0 2:25.62 kubelet
20608 10101 20 0 772016 40276 0 D 2.9 1.0 0:02.26 observability
2533 root 20 0 730992 17424 0 S 2.7 0.4 0:04.57 ssm-agent-worke
This was also confirmed after looking at iotop
.
In particular the discovery service was affecting the high load hence re-testing the deployment without discovery resolved the issue.
Consistently reproducible on EKS.
Steps to reproduce the bug
- Create a fresh
1.22.x
EKS cluster (of 2-3 nodes) - Deploy GEE
1.11.20
(with discovery enabled)
Expected Behavior
Expected GEE to run as normal and for EKS to be stable.
Additional Context
I managed to recover the nodes when rolling back to 1.11.16
. I have not tested any of the versions between 1.11.16
and 1.11.20
Discovery logs as below.
{"level":"info","ts":"2022-06-15T23:32:42.676Z","logger":"fds.v1.event_loop","caller":"v1/setup_event_loop.sk.go:57","msg":"event loop started","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:42.676Z","logger":"uds.v1.event_loop","caller":"v1/setup_event_loop.sk.go:57","msg":"event loop started","version":"undefined"}
I0615 23:32:44.038331 1 request.go:665] Waited for 1.188152658s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/api/v1/namespaces/apps-configuration/secrets?limit=500&resourceVersion=0
{"level":"info","ts":"2022-06-15T23:32:45.883Z","logger":"uds.v1.event_loop.uds.kube-uds","caller":"kubernetes/uds.go:32","msg":"started","version":"undefined","watchns":["petclinic","petstore","apps","apps-configuration","gloo-system","gloo-portal"],"writens":"gloo-system"}
{"level":"info","ts":"2022-06-15T23:32:45.883Z","logger":"uds.v1.event_loop.uds.v1.event_loop","caller":"v1/discovery_event_loop.sk.go:57","msg":"event loop started","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:45.992Z","logger":"fds.v1.event_loop.fds.v1.event_loop","caller":"v1/discovery_event_loop.sk.go:57","msg":"event loop started","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.297Z","logger":"uds.v1.event_loop.uds","caller":"discovery/discovery.go:154","msg":"reconciled upstreams","version":"undefined","discovered_by":"kubernetesplugin","upstreams":16}
{"level":"info","ts":"2022-06-15T23:32:46.321Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:33","msg":"begin sync 6885213428526893516 (18 upstreams)","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.498Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"discovery/discovery.go:154","msg":"reconciled upstreams","version":"undefined","discovered_by":"kubernetesplugin","upstreams":16}
{"level":"info","ts":"2022-06-15T23:32:46.498Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:43","msg":"end sync 6885213428526893516","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.499Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:33","msg":"begin sync 14695981039346656037 (0 upstreams)","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.586Z","logger":"fds.v1.event_loop.fds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:34","msg":"begin sync 6885213428526893516 (18 upstreams)","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.586Z","logger":"fds.v1.event_loop.fds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:44","msg":"end sync 6885213428526893516","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.599Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://rate-limit.gloo-system.svc.cluster.local:18081 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.600Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://extauth.gloo-system.svc.cluster.local:8083 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.607Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://gloo.gloo-system.svc.cluster.local:9977 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.617Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://rate-limit.gloo-system.svc.cluster.local:18081 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.644Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"discovery/discovery.go:154","msg":"reconciled upstreams","version":"undefined","discovered_by":"kubernetesplugin","upstreams":16}
{"level":"info","ts":"2022-06-15T23:32:46.645Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:43","msg":"end sync 14695981039346656037","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:47.357Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc-graphql/grpc_reflection.go:128","msg":"Unable to create GraphQLApis from gRPC reflection for upstream rate-limit in namespace gloo-system: name:\"rate-limit\" namespace:\"gloo-system\" exists","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:47.396Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc-graphql/grpc_reflection.go:128","msg":"Unable to create GraphQLApis from gRPC reflection for upstream gloo-system-rate-limit-18081 in namespace gloo-system: name:\"gloo-system-rate-limit-18081\" namespace:\"gloo-system\" exists","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:48.189Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://gloo.gloo-system.svc.cluster.local:9977 discovered as a gRPC service","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:48.905Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc-graphql/grpc_reflection.go:128","msg":"Unable to create GraphQLApis from gRPC reflection for upstream rate-limit in namespace gloo-system: name:\"rate-limit\" namespace:\"gloo-system\" exists","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:48.923Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc-graphql/grpc_reflection.go:128","msg":"Unable to create GraphQLApis from gRPC reflection for upstream gloo-system-rate-limit-18081 in namespace gloo-system: name:\"gloo-system-rate-limit-18081\" namespace:\"gloo-system\" exists","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.353Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:33","msg":"begin sync 2304029478127982007 (18 upstreams)","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.512Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"discovery/discovery.go:154","msg":"reconciled upstreams","version":"undefined","discovered_by":"kubernetesplugin","upstreams":16}
{"level":"info","ts":"2022-06-15T23:32:49.513Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:43","msg":"end sync 2304029478127982007","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.635Z","logger":"fds.v1.event_loop.fds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:34","msg":"begin sync 2304029478127982007 (18 upstreams)","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.639Z","logger":"fds.v1.event_loop.fds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:44","msg":"end sync 2304029478127982007","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.646Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc-graphql/grpc_reflection.go:128","msg":"Unable to create GraphQLApis from gRPC reflection for upstream extauth in namespace gloo-system: error creating schema from gRPC reflection: listing services. are you sure upstream extauth.gloo-system implements reflection?: rpc error: code = Canceled desc = context canceled","version":"undefined"}
{"level":"error","ts":"2022-06-15T23:32:49.689Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.GraphqlSchemaDiscovery: 2 errors occurred:\n\t* context canceled\n\t* context canceled\n\n","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"error","ts":"2022-06-15T23:32:49.661Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.UpstreamFunctionDiscovery: 2 errors occurred:\n\t* context canceled\n\t* context canceled\n\n","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"warn","ts":"2022-06-15T23:32:49.682Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-extauth-8083 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.683Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gateway-proxy-443 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.683Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gateway-proxy-80 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.684Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-9976 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"error","ts":"2022-06-15T23:32:49.684Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.UpstreamFunctionDiscovery: context canceled","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"warn","ts":"2022-06-15T23:32:49.702Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-9979 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.705Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-9988 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.706Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-fed-console-10101 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.706Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-fed-console-8081 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.707Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-fed-console-8090 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.708Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-glooe-grafana-80 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.708Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-glooe-prometheus-kube-state-metrics-8080 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.709Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-glooe-prometheus-server-80 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"error","ts":"2022-06-15T23:32:49.709Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.UpstreamFunctionDiscovery: context canceled","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"error","ts":"2022-06-15T23:32:49.709Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.GraphqlSchemaDiscovery: context canceled","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"warn","ts":"2022-06-15T23:32:49.730Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-redis-6379 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"error","ts":"2022-06-15T23:32:49.730Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.UpstreamFunctionDiscovery: context canceled","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"error","ts":"2022-06-15T23:32:49.730Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.GraphqlSchemaDiscovery: context canceled","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"info","ts":"2022-06-15T23:32:49.730Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://rate-limit.gloo-system.svc.cluster.local:18081 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.731Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://extauth.gloo-system.svc.cluster.local:8083 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.750Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://gloo.gloo-system.svc.cluster.local:9977 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.771Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://rate-limit.gloo-system.svc.cluster.local:18081 discovered as a gRPC service","version":"undefined"}
I was unable to reproduce this issue by following the steps described After touching base with @pseudonator it seems that the issue is not a pressing concern
Kasun may update with additional detail if it re-arises and/or he has the opportunity to reproduce again and add detail to the steps, but in the meantime the issue can be closed