cloudflare-operator
cloudflare-operator copied to clipboard
Operator Deployment OOMKill.
The memory limits might be a little too low. I wonder if anyone else is seeing the same with this version. I'm not doing anything fancy.
Version: v0.10.0
I needed to patch them to 400Mi (Not precise. I just picked a number):
[
{
"op": "replace",
"path": "/spec/template/spec/containers/1/resources/limits/memory",
"value": "400Mi"
}
]
Thanks!
Edit: Removed an incorrect code line reference.
https://github.com/adyanth/cloudflare-operator/blob/c38e0cc14dceef41729f8f9852c5e3743d392bff/controllers/reconciler.go#L491
I am assuming you mean the Cloudflare tunnel deployment and not the operator itself?
Mine seems to be running fine with the same limits, may I ask which version of cloudflared are you using?
I am assuming you mean the Cloudflare tunnel deployment and not the operator itself?
No, the operator.
This is the snippet from my kustomization.yaml:
patches:
- path: patches/cloudflare-operator-controller-manager-resources.json
target:
group: apps
version: v1
kind: Deployment
name: cloudflare-operator-controller-manager
I'll get a log capture. Not at the system right now.
Okay. I see the confusion. I referenced the the tunnel deployment code in the original post.
This is what I meant to reference: https://github.com/adyanth/cloudflare-operator/blob/c38e0cc14dceef41729f8f9852c5e3743d392bff/config/manager/manager.yaml#L51
Pod logs:
manager I0531 06:34:20.420863 1 request.go:601] Waited for 1.002272575s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/batch/v1?timeout=32s
manager 1.6855148613748305e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"}
kube-rbac-proxy I0531 06:31:02.064905 1 main.go:190] Valid token audiences:
kube-rbac-proxy I0531 06:31:02.065069 1 main.go:262] Generating self signed cert as no cert is provided
kube-rbac-proxy I0531 06:31:02.628100 1 main.go:311] Starting TCP socket on 0.0.0.0:8443
kube-rbac-proxy I0531 06:31:02.628691 1 main.go:318] Listening securely on 0.0.0.0:8443
manager 1.6855148613757348e+09 INFO setup starting manager
manager 1.685514861376296e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
manager 1.6855148613763669e+09 INFO Starting server {"kind": "health probe", "addr": ":8081"}
manager I0531 06:34:21.376443 1 leaderelection.go:248] attempting to acquire leader lease cloudflare-operator-system/9f193cf8.cfargotunnel.com...
manager I0531 06:34:37.877037 1 leaderelection.go:258] successfully acquired lease cloudflare-operator-system/9f193cf8.cfargotunnel.com
manager 1.6855148778770685e+09 DEBUG events cloudflare-operator-controller-manager-548fc568dc-cfs8c_033a82a3-9de3-4d13-90d8-123523d8bed3 became leader {"type": "Normal", "object": {"kind":"Lease","namespace":"cloudflare-operator-system","name":"9f193cf8.cfargotunnel.com","uid":"b5e6b291-a2a9-4593-aac0-d5a4dc43119b","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1514017449"}, "reason": "LeaderElection"}
manager 1.6855148778774395e+09 INFO Starting EventSource {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1alpha1.Tunnel"}
manager 1.6855148778775022e+09 INFO Starting EventSource {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.ConfigMap"}
manager 1.685514877877516e+09 INFO Starting EventSource {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.Secret"}
manager 1.6855148778775249e+09 INFO Starting EventSource {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel", "source": "kind source: *v1.Deployment"}
manager 1.685514877877531e+09 INFO Starting Controller {"controller": "tunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "Tunnel"}
manager 1.6855148778777483e+09 INFO Starting EventSource {"controller": "tunnelbinding", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "TunnelBinding", "source": "kind source: *v1alpha1.TunnelBinding"}
manager 1.6855148778777907e+09 INFO Starting Controller {"controller": "tunnelbinding", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "TunnelBinding"}
manager 1.6855148778777394e+09 INFO Starting EventSource {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1alpha1.ClusterTunnel"}
manager 1.6855148778778672e+09 INFO Starting EventSource {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.ConfigMap"}
manager 1.6855148778778884e+09 INFO Starting EventSource {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.Secret"}
manager 1.6855148778779e+09 INFO Starting EventSource {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel", "source": "kind source: *v1.Deployment"}
manager 1.6855148778779075e+09 INFO Starting Controller {"controller": "clustertunnel", "controllerGroup": "networking.cfargotunnel.com", "controllerKind": "ClusterTunnel"}
Stream closed EOF for cloudflare-operator-system/cloudflare-operator-controller-manager-548fc568dc-cfs8c (manager)
It seems doubling the limit to 200Mi
will get the container to successfully start.
Interesting, the manager hasn't been updated since the last release of v0.10.0. I'm surprised how that container is out of the blue needing more memory.
It only takes about 26MiB in mine. Would you mind sharing a bit more details about your setup (as in approximately how many tunnels and services are being handled by it)? Is it by any chance running on arm and not x64, since I have not validated that myself?
I am interested to see if it is a run time thing based on usage which I should probably call out in the README somewhere since deployments of this I have seen till now never go near 100MiB.
Well this is interesting. Two EKS clusters. Different versions. Both AL2. 1.24.13 : 5.4.241-150.347.amzn2.x86_64 - lower mem 1.23.17 : 5.10.178-162.673.amzn2.x86_64 - higher mem
I wonder if the kube client discovery cache is bloating the memory.
I don't have an excessive number of CRDs in either. I cleaned up the 1.24 cluster before the image above.
I'll clean up the 1.23 cluster tomorrow and see what happens.
That does not seem right and I cannot think of a way to debug why this one is taking more memory (other than profiling it, which I am not sure is worth the effort haha) since the containers themselves do not have any tools for you to exec into. The 50 MB sounds about right. I do not think the kube discovery has anything to do with this, but sure, lemme know. Mine used to be on k8s 1.22, now on 1.26 so the version should not be an issue.
I did get an alloc flame graph with the krew flame plugin. Github does a static rendering, so the 15min one is sort of useless when posted here.
1m:
15m:
I guess I have something wrong with that cluster. I'll roll this out to the rest and compare.
FWIW, there's just a single ClusterTunnel in my deployment. The overlays only change the name of the tunnel.
Is that a alloc count or byte graph? Either way, all I see are k8s libraries used by the controller, nothing from the code from this project. The widest call by x/net/http2 -> compress/gzip seems like a lot of (or a large body of, depending on what graph this is) HTTP requests to the manager pod. If health checks or something like that are misconfigured (to either send a lot of requests or request with large content), it could be a reason too.
fwiw, I'm seeing this behaviour on an OpenShift 4.14 (k8s 1.27) cluster:
after patching the limit, memory usage hovers around 150mb: