shell-operator icon indicating copy to clipboard operation
shell-operator copied to clipboard

Gradual Increase in Memory Consumption

Open J0ram opened this issue 1 year ago • 4 comments

Hi,

Since 1.4.8 we have observed our shell-operator pods slowly consuming memory over time: image

I made a local branch with pprof installed and it appears to be logrus that is not releasing its memory: image

Environment:

  • 1.48 - 1.4.11 (we've tested each version on release)
  • Kubernetes version: AKS 1.29.4
  • Installation type Helm

Worth noting that 1.4.7 behaves as expected on the same cluster.

Anything else we should know?: I find it odd that nobody else is reporting this issue - I can only assume it's some oddity in our environment but I'm pretty much out of ideas.

From what I can see the version of the logrus package hasn't changed between versions of this application (particularly 1.47 - 1.48). If you have any ideas of how we could debug further that would be appreciated.

I've attached the heap dump if that's of any help

Thanks

heap.zip

J0ram avatar Sep 11 '24 07:09 J0ram

Hit by this issue. Tryed to set GOMEMLIMIT with no luck (then checked Go version = 1.19 which does not support a soft memory limit).

Shell Operator: 1.4.12 K8s: 1.30.3 Linux Kernel: 6.6.52 with THP enabled in madvise mode (it is relevant for Go > 1.20 I think)

Reproducer project: https://github.com/cit-consulting/hetzner-failoverip-controller

vladimirfx avatar Sep 27 '24 08:09 vladimirfx

Also hitting this.

Shell Operator: 1.4.10 K8s: 1.29.8

sidineyc avatar Oct 02 '24 09:10 sidineyc

Same here with multiple operators running on different clusters using 1.4.10. Pod crashes and restarts when it hits memory limit. Screenshot 2024-10-18 at 14 51 25

kyale avatar Oct 18 '24 12:10 kyale

Checked 1.4.14 - classic memory leak:

Снимок экрана 2024-10-20 в 16 38 52

Because of Go 1.22 and GOLIMIT, the operator uses a lot of CPU on GC before being killed by Kubelet.

vladimirfx avatar Oct 20 '24 13:10 vladimirfx

Hello. Thank you for the report. We also met the logrus leak a few time ago. we're currently working on changing the logger.

yalosev avatar Oct 23 '24 08:10 yalosev

We have a quick fix in v1.4.15. Could you try it, please.

yalosev avatar Oct 23 '24 12:10 yalosev

We have a quick fix in v1.4.15. Could you try it, please.

The memory profile looks better but I hit by log duplication https://github.com/flant/shell-operator/issues/675

Keeps monitoring.

vladimirfx avatar Oct 23 '24 16:10 vladimirfx

We've been running 1.4.15 overnight - memory usage is completely flat :)

Thanks all

J0ram avatar Oct 24 '24 08:10 J0ram