FutureProducer's `send` has been observed to cause memory leak
I have a service running on ECS that's based on rust:1.80.0-bookworm docker image. its main job is to take data from an http endpoint and proxy it to a specific kafka topic.
The issue was observed after migrating to Amazon's MSK and trying to handle roughly 600 requests a minute. In order to authenticate with MSK i have to leverage aws-msk-iam-sasl-signer crate to create a custom FutureProducer context.
Additionally, here are the feature flags for the rdkafka i'm using
rdkafka = { version = "0.36.2", features = ["cmake-build", "sasl", "ssl"] }
After sifting through my service's code I was able to narrow down the memory leak issue specifically to FutureProducer.send function. I switched it out for send_result and noticed the memory leak has went away.
In the attached image the memory usage you see as orange is ever growing memory consumption using the send function vs the stable blue line using the send_result.
I'm running 3 ECS tasks to rule out any bias and show that it doesn't happen at random.
We are experience the same issue.
The first image shows the usage of send_result while the second image shows send. This 'slow' memory leak is on my machine. In our staging we experience multiple hundred MB/sec leak.
We are using rdkafka = { version = "0.37.0", features = ["ssl", "cmake-build"] }
We are experience the same issue.
I wrote a service that needs to consume messages from one Kafka cluster, and then forward Flink CDC Event messages to another Kafka cluster based on the database and table. After the service I wrote processed 1 million messages, the memory usage increased from 15MB at the beginning to 1.75GB in about 30 seconds.
I used the pmap and dd commands to dump the specific memory and found that a lot of the information was Kafka message content.
We've also seen increased memory usage from this that does not appear to reduce after load is taken away. Additionally, when the producer load is taken entirely away and the process should be idle, we notice persistent CPU cycles. A flamegraph appears to show that it may be continuing to poll the rdkafka queue:
We've also seen increased memory usage from this that does not appear to reduce after load is taken away. Additionally, when the producer load is taken entirely away and the process should be idle, we notice persistent CPU cycles. A flamegraph appears to show that it may be continuing to poll the rdkafka queue:
This actually appears to come from the ThreadedProducer calling poll in a tight loop with a 100ms timeout. If I increase that timeout to 1sec, I don't see the persistent CPU burn.