serving icon indicating copy to clipboard operation
serving copied to clipboard

Autoscaler slow memory leak

Open DavidR91 opened this issue 1 year ago • 6 comments

What version of Knative?

1.16.0

Expected Behavior

Autoscaler is able to GC etc. and avoid OOM

Actual Behavior

There is a visible leak occurring every ~10 hours in the autoscaler in our environment. This creates a constant upward trend in memory use.

Although there is some attempt to GC and reduce this as the memory limit is reached, it's never quite enough and it does eventually OOM and restart

image (2)

About our environment:

  • GKE Kubernetes 1.30.6
  • knative 1.16, istio 1.23.3 and net-istio 1.16
  • The autoscaler is given request and limit of 2 CPU and 2Gi of memory (Guaranteed QoS)
  • The autoscaler is configured into HA mode: we have it scaled so there are 3 running at all times
    • Notable that when the primary autoscaler OOMs, we experience a significant request error spike, because this seems to negatively affect the activator - which is why we're more interested in solving this
    • The leak only seems to affect the primary/leader
  • We typically have about 2-300 different knative services. Most of them will have on average ~3 revisions at any one time
    • The graph above is from when the cluster is almost entirely idle. Most of the period of that graph, there are no service pods running at all
  • We've added GOMEMLIMIT to 1.7GiB to see if this helps keep it under control but it has no effect (it stays alive longer but it does still eventually OOM)
  • Nothing in particular happens in our environment at a 10 hour frequency (we have jobs and new service creation+deletion occurring on 24hr cycles, typically 8AM and midnight)
  • The same issue was observed in knative 1.9.2

Steps to Reproduce the Problem

What would be useful to repro/diagnose this? Is the minimum a debug level log from the autoscaler over the ~10hrs where the issue occurs?

DavidR91 avatar Nov 21 '24 11:11 DavidR91

Hi @DavidR91,

Could you show more about the pod status (kubectl describe pod ...)? What is the behavior of the istio sidecar? In the past there was a similar issue that was coming from the istio side.

What would be useful to repro/diagnose this?

Could you provide the logs of the autoscaler? Could you take a heap dump during the time that the issue occurs?

You can enable profiling as follows.

On one terminal:

cat <<EOF | oc apply -f -
apiVersion: v1
data:
  profiling.enable: "true"
kind: ConfigMap
metadata:
  name: config-observability
  namespace: knative-serving
EOF

Kubectl port-forward <pod-name> -n knative-serving  8008:8008

On another terminal: $ go tool pprof http://localhost:8008/debug/pprof/heap

skonto avatar Nov 26 '24 10:11 skonto

In the past there was https://github.com/knative/serving/issues/8761 that was coming from the istio side.

We are just using istio's gateways, we don't actually use the sidecar or any sidecar injection at all (we just have VirtualServices pointing at the knative gateway with rewritten authority etc. for each service) - so I don't think that one is connected

Getting debug log is a bit more work so I will follow up with those - but I have managed to enable profiling and get the pprof dumps

I've attached two dumps, only a few minutes apart but the latter the memory use had grown by 1-2%. These were taken when the system was under load but the autoscaler was already at ~93% memory vs. limit, so it's very close to OOMing

pprof.autoscaler.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz pprof.autoscaler.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz

and PNG version of the first dump for convenience profile001

I notice there is a lot of exporter+metric stuff here, and we do have knative configure to send to OTel via opencensus - is it enough of a presence in these dumps to suggest that OTel integration is the cause?

DavidR91 avatar Nov 26 '24 15:11 DavidR91

is it enough of a presence in these dumps to suggest that OTel integration is the cause?

Does not seem to be so even if it uses a lot of the allocated memory. I did a diff (go tool prof -base prof1 prof2) on the profiles you posted. Here is the output:

image Same if you pass inuse_objects:

image

The biggest increase is ~40Mb at the streamwatcher. Could you take multiple snapshots and check the diff also during no load? Maybe it is related to this https://github.com/kubernetes/kubernetes/issues/103789#issuecomment-1867588547? Do you have a lot of pods coming up during load times (autoscaler has a filtered informer for service pods)?

skonto avatar Nov 27 '24 10:11 skonto

Btw the default resync period is ~10h, see here https://github.com/knative/serving/blob/main/vendor/knative.dev/pkg/controller/controller.go#L54. Is your cluster a large one, is it slow?

skonto avatar Nov 27 '24 11:11 skonto

@DavidR91 hi, any updates on this one?

skonto avatar Feb 10 '25 10:02 skonto

@DavidR91 just following up again

dprotaso avatar Apr 18 '25 17:04 dprotaso

Closing this out due to lack of input.

Also we've migrated OpenCensus libs to OTel in main - so if you think OpenCensus was the culprit it might be worth trying again after the next release.

dprotaso avatar Jul 11 '25 12:07 dprotaso