messaging-topology-operator icon indicating copy to clipboard operation
messaging-topology-operator copied to clipboard

Slow memory leak

Open ollie-nye opened this issue 1 year ago • 14 comments

Describe the bug

We're seeing a slow memory leak over the course of a few days that's resulting in the operator pod being OOMkilled. Logs look fairly normal except for a few recurring errors, all following these patterns:

  • failed to delete finalizer: Operation cannot be fulfilled on exchanges.rabbitmq.com ""<exchange_name>"": the object has been modified; please apply your changes to the latest version and try again
  • failed to delete finalizer: Operation cannot be fulfilled on bindings.rabbitmq.com "<binding>": StorageError: invalid object, Code: 4, Key: /registry/rabbitmq.com/bindings/<binding>, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: <some uid>, UID in object meta:

It's not much of a leak, just enough to start throwing alerts and then kill the pod every few days. (Usage screenshot below)

I couldn't see any existing issues for something related so opening up a new one

To Reproduce

Our manifests are managed via Argo, but that shouldn't be getting involved in resources the operator looks after. Outside of that, we're running a standard manifest: https://github.com/rabbitmq/messaging-topology-operator/releases/download/v1.13.0/messaging-topology-operator-with-certmanager.yaml

We have a persisted CA being patched in with Kustomize but that doesn't look to be affecting finalizers.

Expected behavior Operator memory usage to remain stable over longer periods of time

Screenshots

If applicable, add screenshots to help explain your problem.

Screenshot 2024-03-13 at 08 59 57

Version and environment information

  • Messaging Topology Operator: 1.13.0
  • RabbitMQ: 3.13.0
  • RabbitMQ Cluster Operator: 2.0.0
  • Kubernetes: v1.27.8
  • Cloud provider or hardware configuration: GKE

Additional context

This has been over the last 30 days at least, I can't say when it started exactly because we don't keep metrics longer than that. We jumped from 1.10.3 to 1.13.0 and the alerts started shortly afterwards

ollie-nye avatar Mar 19 '24 12:03 ollie-nye

@mkuratczyk FYI this may be similar to this https://github.com/rabbitmq/cluster-operator/pull/1549

@ollie-nye how many Secret type objects do you have in your cluster? Does this number vary over time?

Zerpet avatar Apr 08 '24 09:04 Zerpet

That sounds like it might be similar! The total number of resources doesn't change very much but new deployments of services get added and old ones dropped fairly regularly. Could it be that the old references to those resources are hanging around still? Each of our deployments is about ~150 k8s objects and that happens maybe 10-20 times a day, over 10 days or so that sounds like it could start causing problems with cache if they don't get cleared up

ollie-nye avatar Apr 08 '24 09:04 ollie-nye

I've tested this operator for an issue similar to https://github.com/rabbitmq/cluster-operator/pull/1549 and could not trigger anything like it. I believe because the root cause of the issue was that Cluster Operator watches "generic" resources, such as ConfigMaps, Secrets, StatefulSets. Meanwhile, the Topology Operator only watches rabbitmq-specific resources (correct me if I'm wrong).

mkuratczyk avatar Apr 08 '24 09:04 mkuratczyk

@mkuratczyk That 150 count is just the topology operator side, split over (approximately) 60 bindings, 30 exchanges, 30 policies and 30 queues. We set up per-PR environments with their own queues and bindings for testing our main app and such so there's quite a lot of shuffling about happening on the regular!

ollie-nye avatar Apr 08 '24 09:04 ollie-nye

@ollie-nye if you could set ENABLE_DEBUG_PPROF=true in the Operator's pod environment, set up port forwarding and then connect with go tool pprof -web http://localhost:8080/debug/pprof/heap (I don't remember if it's 8080 or 9782) to see what's taking a lot of memory, that would be very helpful

mkuratczyk avatar Apr 08 '24 09:04 mkuratczyk

I've tested this operator for an issue similar to rabbitmq/cluster-operator#1549 and could not trigger anything like it. I believe because the root cause of the issue was that Cluster Operator watches "generic" resources, such as ConfigMaps, Secrets, StatefulSets. Meanwhile, the Topology Operator only watches rabbitmq-specific resources (correct me if I'm wrong).

We also watch Secret objects. Those are required for User credentials and for connectionSecret feature.

Zerpet avatar Apr 08 '24 11:04 Zerpet

@mkuratczyk Here's that output after a few hours of running after it restarted, if there's nothing standing out yet I can grab a new capture in a couple of days after it's grown a bit more?

Screenshot 2024-04-08 at 13 56 56

ollie-nye avatar Apr 08 '24 12:04 ollie-nye

Certainly the CertPool size is suspiciously large (do you have a lot of certificates?) but the key questions is where the usage is growing, so yes - a second capture after some time would be great. thanks

mkuratczyk avatar Apr 08 '24 13:04 mkuratczyk

Not a huge amount, less than 50 across the whole cluster, but sure thing, I'll get another capture after it's sat for a bit. Thanks so much for your help so far!

ollie-nye avatar Apr 08 '24 13:04 ollie-nye

New capture from this morning, same pod has gone from 64MiB memory to 180MiB over the last couple of days. That certpool usage looks to only account for a small chunk of that though?

Screenshot 2024-04-10 at 09 28 56

ollie-nye avatar Apr 10 '24 08:04 ollie-nye

Strange. In the upper left corner, it says "81.86MB total", previously it was 56.21MB. So growth - yes, with certs responsible for a third of that growth. But where is the other 100MB :)

mkuratczyk avatar Apr 10 '24 11:04 mkuratczyk

Ah nice spot! I've got no clue where the extra usage is coming from though, might grab another capture in a few days and go from there?

ollie-nye avatar Apr 10 '24 13:04 ollie-nye

One more capture from today, had a sharp jump up yesterday morning in total use but the profiler seems to be diverging more from pod vs app memory use, with certs are still taking up a decent chunk of that. Our total certificate count hasn't changed during this either, but a lot have been created and destroyed with environment changes, could it just be that the process is hanging on to certificates that are no longer in the cluster?

Screenshot 2024-04-11 at 16 17 27 Screenshot 2024-04-11 at 16 18 18

ollie-nye avatar Apr 11 '24 15:04 ollie-nye

Indeed I can see that creating a lot of Secrets, even completely unrelated to RabbitMQ, increases the memory usage. I will try to fix that, although we do need to watch some of the secrets so the exact solution is not yet clear to me.

mkuratczyk avatar Apr 23 '24 11:04 mkuratczyk

We've seen the pod usage flatten out over the last week or so, thanks so much for getting this resolved!

ollie-nye avatar May 20 '24 08:05 ollie-nye