configurable-http-proxy icon indicating copy to clipboard operation
configurable-http-proxy copied to clipboard

Memory leak in proxy?

Open snickell opened this issue 2 years ago • 41 comments

Aloha, we've been seeing a pattern of growing daily memory usage (followed by increasing slugishness then non-responsiveness above around 1-2GB of RAM) in the 'proxy' pod: image

The different colors are fresh proxy reboots, which have been required to keep the cluster running.

Screen Shot 2022-04-07 at 5 01 00 AM

-Seth

snickell avatar Apr 07 '22 15:04 snickell

Sorry, clipped the units: image

The pattern is nearly identical on the other cluster.

snickell avatar Apr 07 '22 15:04 snickell

We're running z2jh chart version 1.1.3-n354.h751bc313 (I believe the latest ~3 weeks ago), but as you can see, this pattern predates this chart version by quite a bit.

snickell avatar Apr 07 '22 15:04 snickell

We start seeing serious performance problems at about 1.5GB, which is suspiciously close to the heap limit for node 🤔 So maybe its a memory leak that then cascade fails at the heap limit into some sort of .... garbage collection nightmare? or?

snickell avatar Apr 07 '22 15:04 snickell

Do you happen to know if the memory increases are correlated with particular events, e.g. a user starting a new server, or connecting to a particular service?

manics avatar Apr 07 '22 16:04 manics

No, but I'm looking into it, my vague suspicion: websockets? We push them pretty hard, e.g. many users are streaming VNC over websocket. Is there a log mode that has useful stats about e.g. the routing table?

snickell avatar Apr 07 '22 19:04 snickell

OK, so a further development, since high RAM usage correlated with performance problems, I added a k8s memory limit to the pod, thinking it would get killed when it passed 1.4GB of RAM, and reboot fresh, a decent-ish workaround for now.

Here's what happened instead: image

Note that there's one other unusual thing here, I kubectl exec'ed several 200MB "ram balloon" processes to try to push it over the edge faster for testing. They clearly didn't work haha, and I doubt that's why this is not growing at the normal leakage rate, but worth mentioning.

Did something change or did adding a k8s memory limit suddenly change the behavior?

snickell avatar Apr 10 '22 14:04 snickell

(note this otherwise consistent memory growth pattern goes back to jan, and a number of version upgrades since from the z2jh chart..... this is.... weird)

snickell avatar Apr 10 '22 14:04 snickell

Hmmm, so when rhe pod restarts, is it because it has been evicted from a node, or is it because it has restarted its process within the container etc?

Being evicted from a node can happen based on external logic, while managing memory within the container can happen because of more internal logic, which can be enabled by limits to clairfy it needs to not surpass certain limits.

Need to learn more about OOMkiller things within the container vs by the kubelet etc, but perhaps you ended up helping it avoid getting evicted by surpassing its memory limit. Hmmm..

consideRatio avatar Apr 11 '22 05:04 consideRatio

@snickell was what you observed related to load at all? Like, on weekend days do you observe this behavior? We're currently experiencing relatively-speaking high load on our deployment, and I observe something similar. Memory consumption in the proxy will just suddenly shoot up and it becomes non-responsive. Are you still using CHP for your proxy? I am considering swapping it for Traefik in the coming days here.

rcthomas avatar Jul 06 '22 20:07 rcthomas

@snickell have you experienced this with older versions of z2jh -> chp as well?

consideRatio avatar Jul 14 '22 01:07 consideRatio

Still happening on the latest version (v4.5.6).

marcelofernandez avatar Aug 28 '23 13:08 marcelofernandez