confluent-kafka-python icon indicating copy to clipboard operation
confluent-kafka-python copied to clipboard

Memory leak because of stats collection

Open mpolyakov-plutoflume opened this issue 1 year ago • 1 comments

Description

After enabling collecting stats (using stats_cb) I've noticed, that some of our services started consuming much more memory. I was not able to trace it back exactly to confluent_kafka library, but disabling stats collection gets rid of the problem.

How to reproduce

I've tried to make a minimal example and came across an unexpected behavior:

import datetime
import json
import time

from confluent_kafka import Producer


def _stats_cb(stats: str) -> None:
    json.loads(stats)
    print(f"Got stats at {datetime.datetime.now().isoformat()}")


producer = Producer(
    {
        "bootstrap.servers": "PLAINTEXT://localhost:29092",
        "stats_cb": _stats_cb,
        "statistics.interval.ms": 100,
    },
)

producer.produce(topic="email_check.outbound.check.triggered", value=json.dumps({"foo": "bar"}))
print("produced message")
time.sleep(1)
producer.flush()
print("flushed")

In this case, we would see, that _stats_cb has been called ten times, when the flush is called. My assumption is, that if a service publishes a kafka record relatively infrequently, callbacks build up and can lead to memory issue.

Is this an intended behaviour?

Checklist

Please provide the following information:

  • [ ] confluent-kafka-python '2.4.0', 33816576, librdkafka '2.4.0', 33816831'
  • [ ] Apache Kafka broker version: 2.8.2
  • [ ] Client configuration: see example
  • [ ] Operating system: macOs
  • [ ] Provide client logs (with 'debug': '..' as necessary)
  • [ ] Provide broker log excerpts
  • [ ] Critical issue

mpolyakov-plutoflume avatar Sep 24 '24 11:09 mpolyakov-plutoflume

I think I've observed the same, but with a consumer. It appers the combination of statistics.interval.ms and poll timeout matters. With statistics.interval.ms=15000 and poll timeout 1 second, I'm seeing jumps of exactly 264 kB (270336 bytes) about every 4 minutes:

Image

This log is generated with:

while true; do
     echo "$(date '+%Y-%m-%d %H:%M:%S') $(ps -o rss -C python --no-headers | awk '{sum+=$1} END {print sum * 1024 " b
ytes"}')" >> memory_log.txt;
     sleep 30;
done

That rate fits with what we see in production: Image

danhje avatar Jan 31 '25 10:01 danhje

Thanks for the details on the leak, other tickets related hadn't captured rate or isolated calls as well. And apologies no one engaged right away on the thread. I've marked as high priority to put more urgency on investigating.

MSeal avatar Jul 23 '25 23:07 MSeal

Related, possibly same issue: https://github.com/confluentinc/confluent-kafka-python/issues/1361

MSeal avatar Jul 23 '25 23:07 MSeal