gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Latest 1.11.x causing nodes to crash on EKS

Open day0ops opened this issue 2 years ago • 1 comments

Gloo Edge Version

1.11.x (latest stable)

Kubernetes Version

1.22.x

Describe the bug

As soon as Gloo Edge Enterprise 1.11.20 is deployed on a fresh EKS cluster, some of the EKS nodes will start to become non-response and they go into a NotReady state.

Node version points to,

  Kernel Version:           5.4.190-107.353.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:     linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.13
  Kubelet Version:         v1.22.6-eks-7d6806

At a quick glance it appears to be high IO load

top - 08:51:47 up  2:39,  2 users,  load average: 94.71, 35.21, 12.90
Tasks: 156 total,   4 running,  92 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.8 us, 50.1 sy,  0.0 ni,  1.2 id, 47.4 wa,  0.0 hi,  0.0 si,  0.5 st
KiB Mem :  3965408 total,   107772 free,  3805144 used,    52492 buff/cache
KiB Swap:        0 total,        0 free,        0 used.    12940 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  553 root      20   0       0      0      0 R 100.0  0.0   1:39.41 kswapd0
 5223 root      20   0  747964   9100      0 S  22.9  0.2   0:04.78 kube-proxy
20017 10101     20   0 2287760  12972      0 D   6.5  0.3   0:01.42 envoy
 2923 root      20   0 2027528  55764      0 S   5.1  1.4   1:22.81 dockerd
21068 101       20   0  786432  15948      0 S   4.4  0.4   0:01.04 gloo-fed-apiser
 5707 root      20   0  759972  23320      0 D   3.8  0.6   0:04.83 aws-k8s-agent
21167 472       20   0  783872  35588      0 S   3.2  0.9   0:05.97 grafana-server
21655 root      20   0 1100720   3060      0 D   3.1  0.1   0:00.64 runc:[2:INIT]
20428 10101     20   0  775252  43580      0 S   3.0  1.1   0:01.87 rate-limit
20530 10101     20   0 8320340   3.1g      0 R   3.0 82.0   0:11.84 discovery
 3139 root      20   0 1821096  38004      0 S   2.9  1.0   2:25.62 kubelet
20608 10101     20   0  772016  40276      0 D   2.9  1.0   0:02.26 observability
 2533 root      20   0  730992  17424      0 S   2.7  0.4   0:04.57 ssm-agent-worke

This was also confirmed after looking at iotop.

In particular the discovery service was affecting the high load hence re-testing the deployment without discovery resolved the issue.

Consistently reproducible on EKS.

Steps to reproduce the bug

  1. Create a fresh 1.22.x EKS cluster (of 2-3 nodes)
  2. Deploy GEE 1.11.20 (with discovery enabled)

Expected Behavior

Expected GEE to run as normal and for EKS to be stable.

Additional Context

I managed to recover the nodes when rolling back to 1.11.16. I have not tested any of the versions between 1.11.16 and 1.11.20

day0ops avatar Jun 15 '22 23:06 day0ops

Discovery logs as below.

{"level":"info","ts":"2022-06-15T23:32:42.676Z","logger":"fds.v1.event_loop","caller":"v1/setup_event_loop.sk.go:57","msg":"event loop started","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:42.676Z","logger":"uds.v1.event_loop","caller":"v1/setup_event_loop.sk.go:57","msg":"event loop started","version":"undefined"}
I0615 23:32:44.038331       1 request.go:665] Waited for 1.188152658s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/api/v1/namespaces/apps-configuration/secrets?limit=500&resourceVersion=0
{"level":"info","ts":"2022-06-15T23:32:45.883Z","logger":"uds.v1.event_loop.uds.kube-uds","caller":"kubernetes/uds.go:32","msg":"started","version":"undefined","watchns":["petclinic","petstore","apps","apps-configuration","gloo-system","gloo-portal"],"writens":"gloo-system"}
{"level":"info","ts":"2022-06-15T23:32:45.883Z","logger":"uds.v1.event_loop.uds.v1.event_loop","caller":"v1/discovery_event_loop.sk.go:57","msg":"event loop started","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:45.992Z","logger":"fds.v1.event_loop.fds.v1.event_loop","caller":"v1/discovery_event_loop.sk.go:57","msg":"event loop started","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.297Z","logger":"uds.v1.event_loop.uds","caller":"discovery/discovery.go:154","msg":"reconciled upstreams","version":"undefined","discovered_by":"kubernetesplugin","upstreams":16}
{"level":"info","ts":"2022-06-15T23:32:46.321Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:33","msg":"begin sync 6885213428526893516 (18 upstreams)","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.498Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"discovery/discovery.go:154","msg":"reconciled upstreams","version":"undefined","discovered_by":"kubernetesplugin","upstreams":16}
{"level":"info","ts":"2022-06-15T23:32:46.498Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:43","msg":"end sync 6885213428526893516","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.499Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:33","msg":"begin sync 14695981039346656037 (0 upstreams)","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.586Z","logger":"fds.v1.event_loop.fds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:34","msg":"begin sync 6885213428526893516 (18 upstreams)","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.586Z","logger":"fds.v1.event_loop.fds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:44","msg":"end sync 6885213428526893516","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.599Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://rate-limit.gloo-system.svc.cluster.local:18081 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.600Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://extauth.gloo-system.svc.cluster.local:8083 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.607Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://gloo.gloo-system.svc.cluster.local:9977 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.617Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://rate-limit.gloo-system.svc.cluster.local:18081 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:46.644Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"discovery/discovery.go:154","msg":"reconciled upstreams","version":"undefined","discovered_by":"kubernetesplugin","upstreams":16}
{"level":"info","ts":"2022-06-15T23:32:46.645Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:43","msg":"end sync 14695981039346656037","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:47.357Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc-graphql/grpc_reflection.go:128","msg":"Unable to create GraphQLApis from gRPC reflection for upstream rate-limit in namespace gloo-system: name:\"rate-limit\" namespace:\"gloo-system\" exists","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:47.396Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc-graphql/grpc_reflection.go:128","msg":"Unable to create GraphQLApis from gRPC reflection for upstream gloo-system-rate-limit-18081 in namespace gloo-system: name:\"gloo-system-rate-limit-18081\" namespace:\"gloo-system\" exists","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:48.189Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://gloo.gloo-system.svc.cluster.local:9977 discovered as a gRPC service","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:48.905Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc-graphql/grpc_reflection.go:128","msg":"Unable to create GraphQLApis from gRPC reflection for upstream rate-limit in namespace gloo-system: name:\"rate-limit\" namespace:\"gloo-system\" exists","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:48.923Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc-graphql/grpc_reflection.go:128","msg":"Unable to create GraphQLApis from gRPC reflection for upstream gloo-system-rate-limit-18081 in namespace gloo-system: name:\"gloo-system-rate-limit-18081\" namespace:\"gloo-system\" exists","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.353Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:33","msg":"begin sync 2304029478127982007 (18 upstreams)","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.512Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"discovery/discovery.go:154","msg":"reconciled upstreams","version":"undefined","discovered_by":"kubernetesplugin","upstreams":16}
{"level":"info","ts":"2022-06-15T23:32:49.513Z","logger":"uds.v1.event_loop.uds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:43","msg":"end sync 2304029478127982007","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.635Z","logger":"fds.v1.event_loop.fds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:34","msg":"begin sync 2304029478127982007 (18 upstreams)","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.639Z","logger":"fds.v1.event_loop.fds.v1.event_loop.syncer","caller":"syncer/discovery_syncer.go:44","msg":"end sync 2304029478127982007","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.646Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc-graphql/grpc_reflection.go:128","msg":"Unable to create GraphQLApis from gRPC reflection for upstream extauth in namespace gloo-system: error creating schema from gRPC reflection: listing services. are you sure upstream extauth.gloo-system implements reflection?: rpc error: code = Canceled desc = context canceled","version":"undefined"}
{"level":"error","ts":"2022-06-15T23:32:49.689Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.GraphqlSchemaDiscovery: 2 errors occurred:\n\t* context canceled\n\t* context canceled\n\n","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"error","ts":"2022-06-15T23:32:49.661Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.UpstreamFunctionDiscovery: 2 errors occurred:\n\t* context canceled\n\t* context canceled\n\n","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"warn","ts":"2022-06-15T23:32:49.682Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-extauth-8083 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.683Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gateway-proxy-443 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.683Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gateway-proxy-80 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.684Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-9976 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"error","ts":"2022-06-15T23:32:49.684Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.UpstreamFunctionDiscovery: context canceled","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"warn","ts":"2022-06-15T23:32:49.702Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-9979 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.705Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-9988 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.706Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-fed-console-10101 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.706Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-fed-console-8081 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.707Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-gloo-fed-console-8090 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.708Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-glooe-grafana-80 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.708Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-glooe-prometheus-kube-state-metrics-8080 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"warn","ts":"2022-06-15T23:32:49.709Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-glooe-prometheus-server-80 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"error","ts":"2022-06-15T23:32:49.709Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.UpstreamFunctionDiscovery: context canceled","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"error","ts":"2022-06-15T23:32:49.709Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.GraphqlSchemaDiscovery: context canceled","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"warn","ts":"2022-06-15T23:32:49.730Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:134","msg":"unable to discover upstream gloo-system-redis-6379 in namespace gloo-system, err: context canceled","version":"undefined"}
{"level":"error","ts":"2022-06-15T23:32:49.730Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.UpstreamFunctionDiscovery: context canceled","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"error","ts":"2022-06-15T23:32:49.730Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"fds/updater.go:353","msg":"Error doing discovery *grpc.GraphqlSchemaDiscovery: context canceled","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/fds.(*updaterUpdater).Run.func3\n\t/go/pkg/mod/github.com/solo-io/[email protected]/projects/discovery/pkg/fds/updater.go:353"}
{"level":"info","ts":"2022-06-15T23:32:49.730Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://rate-limit.gloo-system.svc.cluster.local:18081 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.731Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://extauth.gloo-system.svc.cluster.local:8083 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.750Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://gloo.gloo-system.svc.cluster.local:9977 discovered as a gRPC service","version":"undefined"}
{"level":"info","ts":"2022-06-15T23:32:49.771Z","logger":"fds.v1.event_loop.fds.function-discovery-updater","caller":"grpc/grpc.go:131","msg":"tcp://rate-limit.gloo-system.svc.cluster.local:18081 discovered as a gRPC service","version":"undefined"}

day0ops avatar Jun 15 '22 23:06 day0ops

I was unable to reproduce this issue by following the steps described After touching base with @pseudonator it seems that the issue is not a pressing concern

Kasun may update with additional detail if it re-arises and/or he has the opportunity to reproduce again and add detail to the steps, but in the meantime the issue can be closed

bewebi avatar Aug 18 '22 14:08 bewebi