Autoscaler slow memory leak
What version of Knative?
1.16.0
Expected Behavior
Autoscaler is able to GC etc. and avoid OOM
Actual Behavior
There is a visible leak occurring every ~10 hours in the autoscaler in our environment. This creates a constant upward trend in memory use.
Although there is some attempt to GC and reduce this as the memory limit is reached, it's never quite enough and it does eventually OOM and restart
About our environment:
- GKE Kubernetes 1.30.6
- knative 1.16, istio 1.23.3 and net-istio 1.16
- The autoscaler is given request and limit of 2 CPU and 2Gi of memory (Guaranteed QoS)
- The autoscaler is configured into HA mode: we have it scaled so there are 3 running at all times
- Notable that when the primary autoscaler OOMs, we experience a significant request error spike, because this seems to negatively affect the activator - which is why we're more interested in solving this
- The leak only seems to affect the primary/leader
- We typically have about 2-300 different knative services. Most of them will have on average ~3 revisions at any one time
- The graph above is from when the cluster is almost entirely idle. Most of the period of that graph, there are no service pods running at all
- We've added
GOMEMLIMITto 1.7GiB to see if this helps keep it under control but it has no effect (it stays alive longer but it does still eventually OOM) - Nothing in particular happens in our environment at a 10 hour frequency (we have jobs and new service creation+deletion occurring on 24hr cycles, typically 8AM and midnight)
- The same issue was observed in knative 1.9.2
Steps to Reproduce the Problem
What would be useful to repro/diagnose this? Is the minimum a debug level log from the autoscaler over the ~10hrs where the issue occurs?
Hi @DavidR91,
Could you show more about the pod status (kubectl describe pod ...)? What is the behavior of the istio sidecar?
In the past there was a similar issue that was coming from the istio side.
What would be useful to repro/diagnose this?
Could you provide the logs of the autoscaler? Could you take a heap dump during the time that the issue occurs?
You can enable profiling as follows.
On one terminal:
cat <<EOF | oc apply -f -
apiVersion: v1
data:
profiling.enable: "true"
kind: ConfigMap
metadata:
name: config-observability
namespace: knative-serving
EOF
Kubectl port-forward <pod-name> -n knative-serving 8008:8008
On another terminal:
$ go tool pprof http://localhost:8008/debug/pprof/heap
In the past there was https://github.com/knative/serving/issues/8761 that was coming from the istio side.
We are just using istio's gateways, we don't actually use the sidecar or any sidecar injection at all (we just have VirtualServices pointing at the knative gateway with rewritten authority etc. for each service) - so I don't think that one is connected
Getting debug log is a bit more work so I will follow up with those - but I have managed to enable profiling and get the pprof dumps
I've attached two dumps, only a few minutes apart but the latter the memory use had grown by 1-2%. These were taken when the system was under load but the autoscaler was already at ~93% memory vs. limit, so it's very close to OOMing
pprof.autoscaler.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz pprof.autoscaler.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
and PNG version of the first dump for convenience profile001
I notice there is a lot of exporter+metric stuff here, and we do have knative configure to send to OTel via opencensus - is it enough of a presence in these dumps to suggest that OTel integration is the cause?
is it enough of a presence in these dumps to suggest that OTel integration is the cause?
Does not seem to be so even if it uses a lot of the allocated memory. I did a diff (go tool prof -base prof1 prof2) on the profiles you posted. Here is the output:
Same if you pass inuse_objects:
The biggest increase is ~40Mb at the streamwatcher. Could you take multiple snapshots and check the diff also during no load? Maybe it is related to this https://github.com/kubernetes/kubernetes/issues/103789#issuecomment-1867588547? Do you have a lot of pods coming up during load times (autoscaler has a filtered informer for service pods)?
Btw the default resync period is ~10h, see here https://github.com/knative/serving/blob/main/vendor/knative.dev/pkg/controller/controller.go#L54. Is your cluster a large one, is it slow?
@DavidR91 hi, any updates on this one?
@DavidR91 just following up again
Closing this out due to lack of input.
Also we've migrated OpenCensus libs to OTel in main - so if you think OpenCensus was the culprit it might be worth trying again after the next release.