anthos-service-mesh-packages icon indicating copy to clipboard operation
anthos-service-mesh-packages copied to clipboard

canonical-service-controller-manager is getting an OOMKilled

Open juancarrillo-ai opened this issue 3 years ago • 1 comments

My service was working well until last week. Now I'm getting an "OOMKilled" message and the pod canonical-service-controller-manager is crashing. After investigating the issue, I saw that the manager container had the following config:

        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 20Mi

And this was the log messages I got:

kubectl -n asm-system logs -f canonical-service-controller-manager-5d576ff694-hs7d5 manager
I0813 04:37:07.910464       1 request.go:621] Throttling request took 1.046974522s, request: GET:https://x.x.x.x:443/apis/rbac.authorization.k8s.io/v1beta1?timeout=32s
2021-08-13T04:37:09.015Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
2021-08-13T04:37:09.016Z	INFO	setup	starting manager
I0813 04:37:09.016231       1 leaderelection.go:242] attempting to acquire leader lease  asm-system/8f5e826b.cloud.google.com...


kubectl -n asm-system get pod -w
NAME                                                    READY   STATUS      RESTARTS   AGE
canonical-service-controller-manager-5d576ff694-hs7d5   1/2     OOMKilled   1          30s 

I increased the memory limits and now It's working properly:

        resources:
          limits:
            cpu: 100m
            memory: 128Mi
          requests:
            cpu: 100m
            memory: 20Mi

These were the logs after the change:

kubectl -n asm-system logs -f canonical-service-controller-manager-664f98b597-2n4l6 manager
I0813 04:39:38.437338       1 request.go:621] Throttling request took 1.000893297s, request: GET:https:/x.x.x.x:443/apis/nodemanagement.gke.io/v1alpha1?timeout=32s
2021-08-13T04:39:39.442Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
2021-08-13T04:39:39.443Z	INFO	setup	starting manager
I0813 04:39:39.443236       1 leaderelection.go:242] attempting to acquire leader lease  asm-system/8f5e826b.cloud.google.com...
2021-08-13T04:39:39.443Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
I0813 04:39:56.650762       1 leaderelection.go:252] successfully acquired lease asm-system/8f5e826b.cloud.google.com

Any idea why this is happening?

juancarrillo-ai avatar Aug 13 '21 04:08 juancarrillo-ai

Ended up taking a bit to figure out miscellaneous issues, but we think the fix is in #882 and will go out with the next 1.10 release. If you don't need it backported to a previous version, we can close the issue.

zerobfd avatar Aug 20 '21 03:08 zerobfd