Memory leak because of stats collection
Description
After enabling collecting stats (using stats_cb) I've noticed, that some of our services started consuming much more memory. I was not able to trace it back exactly to confluent_kafka library, but disabling stats collection gets rid of the problem.
How to reproduce
I've tried to make a minimal example and came across an unexpected behavior:
import datetime
import json
import time
from confluent_kafka import Producer
def _stats_cb(stats: str) -> None:
json.loads(stats)
print(f"Got stats at {datetime.datetime.now().isoformat()}")
producer = Producer(
{
"bootstrap.servers": "PLAINTEXT://localhost:29092",
"stats_cb": _stats_cb,
"statistics.interval.ms": 100,
},
)
producer.produce(topic="email_check.outbound.check.triggered", value=json.dumps({"foo": "bar"}))
print("produced message")
time.sleep(1)
producer.flush()
print("flushed")
In this case, we would see, that _stats_cb has been called ten times, when the flush is called.
My assumption is, that if a service publishes a kafka record relatively infrequently, callbacks build up and can lead to memory issue.
Is this an intended behaviour?
Checklist
Please provide the following information:
- [ ] confluent-kafka-python '2.4.0', 33816576, librdkafka '2.4.0', 33816831'
- [ ] Apache Kafka broker version: 2.8.2
- [ ] Client configuration: see example
- [ ] Operating system: macOs
- [ ] Provide client logs (with
'debug': '..'as necessary) - [ ] Provide broker log excerpts
- [ ] Critical issue
I think I've observed the same, but with a consumer. It appers the combination of statistics.interval.ms and poll timeout matters. With statistics.interval.ms=15000 and poll timeout 1 second, I'm seeing jumps of exactly 264 kB (270336 bytes) about every 4 minutes:
This log is generated with:
while true; do
echo "$(date '+%Y-%m-%d %H:%M:%S') $(ps -o rss -C python --no-headers | awk '{sum+=$1} END {print sum * 1024 " b
ytes"}')" >> memory_log.txt;
sleep 30;
done
That rate fits with what we see in production:
Thanks for the details on the leak, other tickets related hadn't captured rate or isolated calls as well. And apologies no one engaged right away on the thread. I've marked as high priority to put more urgency on investigating.
Related, possibly same issue: https://github.com/confluentinc/confluent-kafka-python/issues/1361
Thanks for raising this issue!
Summary of RCA
- We executed replication script without the stats_cb , and noted no observable increase in memory consumption.
- Then enabled the
stats_cband ran the script for approx one hour, noticed increase in memory (approx 0.5MB in 15min) - First debugged [stat ops queue] to check if queue is leaking (https://github.com/confluentinc/librdkafka/blob/477eb2ef81606e853b85cad4037f184dd747840c/src/rdkafka.c#L2033-L2038) :
- Added debugging logs to the local librdkafka build, which confirmed that the internal statistics queue is successfully being cleared after 1 - 10 size, negating the hypothesis of an internal queue overflow. Also attempted to manually impose a size limit(100) on the librdkafka statistics queue, but it yielded no change in the memory growth pattern.
- This verifies that issue is not because of stats_cb ops queue, negating what was suggested by issue description
- Second ran Heap profiling(tracemalloc)
- It identified an accumulation of Python string objects (code link) , suggesting the stats string payloads are building up. However, this is typically a scenario the GC is expected to resolve.
- After few iterations, GC runs and memory consumption drops significantly (check graphs)
- This verifies rise in memory due to strings in heap, and explains why community member observed increase in memory is his local setup.
Memory Growth images:
@danhje Could you please share application code, of graph, to validate if callback is storing any objects?
With above RCA it seems like memory leak issue is not present in latest versions of librdkafka and python. We will be closing this issue, please feel free to reopen if you face it again.