messaging-topology-operator
messaging-topology-operator copied to clipboard
Slow memory leak
Describe the bug
We're seeing a slow memory leak over the course of a few days that's resulting in the operator pod being OOMkilled. Logs look fairly normal except for a few recurring errors, all following these patterns:
failed to delete finalizer: Operation cannot be fulfilled on exchanges.rabbitmq.com ""<exchange_name>"": the object has been modified; please apply your changes to the latest version and try againfailed to delete finalizer: Operation cannot be fulfilled on bindings.rabbitmq.com "<binding>": StorageError: invalid object, Code: 4, Key: /registry/rabbitmq.com/bindings/<binding>, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: <some uid>, UID in object meta:
It's not much of a leak, just enough to start throwing alerts and then kill the pod every few days. (Usage screenshot below)
I couldn't see any existing issues for something related so opening up a new one
To Reproduce
Our manifests are managed via Argo, but that shouldn't be getting involved in resources the operator looks after. Outside of that, we're running a standard manifest: https://github.com/rabbitmq/messaging-topology-operator/releases/download/v1.13.0/messaging-topology-operator-with-certmanager.yaml
We have a persisted CA being patched in with Kustomize but that doesn't look to be affecting finalizers.
Expected behavior Operator memory usage to remain stable over longer periods of time
Screenshots
If applicable, add screenshots to help explain your problem.
Version and environment information
- Messaging Topology Operator: 1.13.0
- RabbitMQ: 3.13.0
- RabbitMQ Cluster Operator: 2.0.0
- Kubernetes: v1.27.8
- Cloud provider or hardware configuration: GKE
Additional context
This has been over the last 30 days at least, I can't say when it started exactly because we don't keep metrics longer than that. We jumped from 1.10.3 to 1.13.0 and the alerts started shortly afterwards
@mkuratczyk FYI this may be similar to this https://github.com/rabbitmq/cluster-operator/pull/1549
@ollie-nye how many Secret type objects do you have in your cluster? Does this number vary over time?
That sounds like it might be similar! The total number of resources doesn't change very much but new deployments of services get added and old ones dropped fairly regularly. Could it be that the old references to those resources are hanging around still? Each of our deployments is about ~150 k8s objects and that happens maybe 10-20 times a day, over 10 days or so that sounds like it could start causing problems with cache if they don't get cleared up
I've tested this operator for an issue similar to https://github.com/rabbitmq/cluster-operator/pull/1549 and could not trigger anything like it. I believe because the root cause of the issue was that Cluster Operator watches "generic" resources, such as ConfigMaps, Secrets, StatefulSets. Meanwhile, the Topology Operator only watches rabbitmq-specific resources (correct me if I'm wrong).
@mkuratczyk That 150 count is just the topology operator side, split over (approximately) 60 bindings, 30 exchanges, 30 policies and 30 queues. We set up per-PR environments with their own queues and bindings for testing our main app and such so there's quite a lot of shuffling about happening on the regular!
@ollie-nye if you could set ENABLE_DEBUG_PPROF=true in the Operator's pod environment, set up port forwarding and then connect with go tool pprof -web http://localhost:8080/debug/pprof/heap (I don't remember if it's 8080 or 9782) to see what's taking a lot of memory, that would be very helpful
I've tested this operator for an issue similar to rabbitmq/cluster-operator#1549 and could not trigger anything like it. I believe because the root cause of the issue was that Cluster Operator watches "generic" resources, such as ConfigMaps, Secrets, StatefulSets. Meanwhile, the Topology Operator only watches rabbitmq-specific resources (correct me if I'm wrong).
We also watch Secret objects. Those are required for User credentials and for connectionSecret feature.
@mkuratczyk Here's that output after a few hours of running after it restarted, if there's nothing standing out yet I can grab a new capture in a couple of days after it's grown a bit more?
Certainly the CertPool size is suspiciously large (do you have a lot of certificates?) but the key questions is where the usage is growing, so yes - a second capture after some time would be great. thanks
Not a huge amount, less than 50 across the whole cluster, but sure thing, I'll get another capture after it's sat for a bit. Thanks so much for your help so far!
New capture from this morning, same pod has gone from 64MiB memory to 180MiB over the last couple of days. That certpool usage looks to only account for a small chunk of that though?
Strange. In the upper left corner, it says "81.86MB total", previously it was 56.21MB. So growth - yes, with certs responsible for a third of that growth. But where is the other 100MB :)
Ah nice spot! I've got no clue where the extra usage is coming from though, might grab another capture in a few days and go from there?
One more capture from today, had a sharp jump up yesterday morning in total use but the profiler seems to be diverging more from pod vs app memory use, with certs are still taking up a decent chunk of that. Our total certificate count hasn't changed during this either, but a lot have been created and destroyed with environment changes, could it just be that the process is hanging on to certificates that are no longer in the cluster?
Indeed I can see that creating a lot of Secrets, even completely unrelated to RabbitMQ, increases the memory usage. I will try to fix that, although we do need to watch some of the secrets so the exact solution is not yet clear to me.
We've seen the pod usage flatten out over the last week or so, thanks so much for getting this resolved!