cloudflare-operator Operator Deployment OOMKill.

The memory limits might be a little too low. I wonder if anyone else is seeing the same with this version. I'm not doing anything fancy.

Version: v0.10.0

I needed to patch them to 400Mi (Not precise. I just picked a number):

[
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/1/resources/limits/memory",
    "value": "400Mi"
  }
]

Thanks!

Edit: Removed an incorrect code line reference.

https://github.com/adyanth/cloudflare-operator/blob/c38e0cc14dceef41729f8f9852c5e3743d392bff/controllers/reconciler.go#L491

May 31 '23 01:05 matthewhembree

I am assuming you mean the Cloudflare tunnel deployment and not the operator itself?

Mine seems to be running fine with the same limits, may I ask which version of cloudflared are you using?

May 31 '23 04:05 adyanth

I am assuming you mean the Cloudflare tunnel deployment and not the operator itself?

No, the operator.

This is the snippet from my kustomization.yaml:

patches:
- path: patches/cloudflare-operator-controller-manager-resources.json
  target:
    group: apps
    version: v1
    kind: Deployment
    name: cloudflare-operator-controller-manager

May 31 '23 05:05 matthewhembree

I'll get a log capture. Not at the system right now.

May 31 '23 05:05 matthewhembree

Okay. I see the confusion. I referenced the the tunnel deployment code in the original post.

This is what I meant to reference: https://github.com/adyanth/cloudflare-operator/blob/c38e0cc14dceef41729f8f9852c5e3743d392bff/config/manager/manager.yaml#L51

Pod logs:

manager I0531 06:34:20.420863       1 request.go:601] Waited for 1.002272575s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/batch/v1?timeout=32s
manager 1.6855148613748305e+09    INFO    controller-runtime.metrics    Metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
kube-rbac-proxy I0531 06:31:02.064905       1 main.go:190] Valid token audiences:
kube-rbac-proxy I0531 06:31:02.065069       1 main.go:262] Generating self signed cert as no cert is provided
kube-rbac-proxy I0531 06:31:02.628100       1 main.go:311] Starting TCP socket on 0.0.0.0:8443
kube-rbac-proxy I0531 06:31:02.628691       1 main.go:318] Listening securely on 0.0.0.0:8443
manager 1.6855148613757348e+09    INFO    setup    starting manager
manager 1.685514861376296e+09    INFO    Starting server    {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
manager 1.6855148613763669e+09    INFO    Starting server    {"kind": "health probe", "addr": ":8081"}
manager I0531 06:34:21.376443       1 leaderelection.go:248] attempting to acquire leader lease cloudflare-operator-system/9f193cf8.cfargotunnel.com...
manager I0531 06:34:37.877037       1 leaderelection.go:258] successfully acquired lease cloudflare-operator-system/9f193cf8.cfargotunnel.com
manager 1.6855148778770685e+09    DEBUG    events    cloudflare-operator-controller-manager-548fc568dc-cfs8c_033a82a3-9de3-4d13-90d8-123523d8bed3 became leader    {"type": "Normal", "object": {"kind":"Lease","namespace":"cloudflare-operator-system","name":"9f193cf8.cfargotunnel.com","uid":"b5e6b291-a2a9-4593-aac0-d5a4dc43119b","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1514017449"}, "reason": "LeaderElection"}
manager 1.6855148778774395e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1alpha1.Tunnel"}
manager 1.6855148778775022e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.ConfigMap"}
manager 1.685514877877516e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.Secret"}
manager 1.6855148778775249e+09    INFO    Starting EventSource    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.Deployment"}
manager 1.685514877877531e+09    INFO    Starting Controller    {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel"}
manager 1.6855148778777483e+09    INFO    Starting EventSource    {"controller": "tunnelbinding", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "TunnelBinding", "source": "kind source: *v1alpha1.TunnelBinding"}
manager 1.6855148778777907e+09    INFO    Starting Controller    {"controller": "tunnelbinding", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "TunnelBinding"}
manager 1.6855148778777394e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1alpha1.ClusterTunnel"}
manager 1.6855148778778672e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.ConfigMap"}
manager 1.6855148778778884e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.Secret"}
manager 1.6855148778779e+09    INFO    Starting EventSource    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.Deployment"}
manager 1.6855148778779075e+09    INFO    Starting Controller    {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel"}
Stream closed EOF for cloudflare-operator-system/cloudflare-operator-controller-manager-548fc568dc-cfs8c (manager)

It seems doubling the limit to 200Mi will get the container to successfully start.

May 31 '23 06:05 matthewhembree

Interesting, the manager hasn't been updated since the last release of v0.10.0. I'm surprised how that container is out of the blue needing more memory.

It only takes about 26MiB in mine. Would you mind sharing a bit more details about your setup (as in approximately how many tunnels and services are being handled by it)? Is it by any chance running on arm and not x64, since I have not validated that myself?

I am interested to see if it is a run time thing based on usage which I should probably call out in the README somewhere since deployments of this I have seen till now never go near 100MiB.

May 31 '23 07:05 adyanth

Well this is interesting. Two EKS clusters. Different versions. Both AL2. 1.24.13 : 5.4.241-150.347.amzn2.x86_64 - lower mem 1.23.17 : 5.10.178-162.673.amzn2.x86_64 - higher mem

Jun 01 '23 00:06 matthewhembree

I wonder if the kube client discovery cache is bloating the memory.

I don't have an excessive number of CRDs in either. I cleaned up the 1.24 cluster before the image above.

I'll clean up the 1.23 cluster tomorrow and see what happens.

Jun 01 '23 00:06 matthewhembree

That does not seem right and I cannot think of a way to debug why this one is taking more memory (other than profiling it, which I am not sure is worth the effort haha) since the containers themselves do not have any tools for you to exec into. The 50 MB sounds about right. I do not think the kube discovery has anything to do with this, but sure, lemme know. Mine used to be on k8s 1.22, now on 1.26 so the version should not be an issue.

Jun 01 '23 00:06 adyanth

I did get an alloc flame graph with the krew flame plugin. Github does a static rendering, so the 15min one is sort of useless when posted here.

1m: alloc-flamegraph

15m: long-alloc-flamegraph

I guess I have something wrong with that cluster. I'll roll this out to the rest and compare.

FWIW, there's just a single ClusterTunnel in my deployment. The overlays only change the name of the tunnel.

Jun 01 '23 17:06 matthewhembree

Is that a alloc count or byte graph? Either way, all I see are k8s libraries used by the controller, nothing from the code from this project. The widest call by x/net/http2 -> compress/gzip seems like a lot of (or a large body of, depending on what graph this is) HTTP requests to the manager pod. If health checks or something like that are misconfigured (to either send a lot of requests or request with large content), it could be a reason too.

Jun 02 '23 02:06 adyanth

fwiw, I'm seeing this behaviour on an OpenShift 4.14 (k8s 1.27) cluster:

after patching the limit, memory usage hovers around 150mb:

Nov 14 '23 06:11 hrrrsn

cloudflare-operator cloudflare-operator copied to clipboard

Operator Deployment OOMKill.

cloudflare-operator
cloudflare-operator copied to clipboard