protobuf icon indicating copy to clipboard operation
protobuf copied to clipboard

[Python] Potential memory leak

Open viktorvorobev opened this issue 1 month ago • 3 comments

What version of protobuf and what language are you using? Version: 6.33.0 Language: Python

What operating system (Linux, Windows, ...) and version? OS: Ubuntu Linux 22.0.1 Kernel: 6.8.0-87-generic

What runtime / compiler are you using (e.g., python version or gcc version) Python 3.12

What did you do? I've hacked together a little example repo (kudos to @Atheuz for the original that I've forked from), but essentially it is as follows:

obj = schema_pb2.SomeObj()

for _ in range(10_000_000):
    obj.ParseFromString(b"...")

del obj

What did you expect to see

That there would be no growth in memory usage

What did you see instead?

Until the obj is deleted, the memory grew uncontrollably.

Anything else we should know about your project / environment

This is a continuation of https://github.com/protocolbuffers/protobuf/issues/10088 (I think).

I've started a Google Group Discussion where it was explained to me, that the behaviour is intended and expected.

But it is still kind of counterintuitive to see that Unpack and ParseFromString actually allocate more and more memory. I think it deserves at least a mention in the docs on how to properly use them. And thus I was advised to open an issue here.

Thank you!

viktorvorobev avatar Nov 03 '25 14:11 viktorvorobev

Just initial response, it looks like https://github.com/protocolbuffers/protobuf/issues/10088 was a bit different, at least that the root cause work that Josh drove it was about C free() behavior not even 'releasing' the memory to the OS. It may be that one of the latter comments that identified they still had a problem were confused it with implications of the Py-upb Arena model though.

esrauchg avatar Nov 03 '25 15:11 esrauchg

For anyone stumbled across this thing as I did, there are 2 workarounds, as it seems:

  1. Add PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python before running your script. This will enforce python backend instead of upb backend, which seems to be MUCH slower, but does not "leak"

  2. Simply delete and recreate the object every time. E.g. this example does not leak at all:

    for _ in range(10_000_000):
        obj = schema_pb2.SomeObj()
        obj.ParseFromString(b"...")
    

viktorvorobev avatar Nov 03 '25 15:11 viktorvorobev

Second "just initial response" I think we actually should be able to change the behavior such that PyUpb_Message_Clear actually just effectively does PyUpb_Message_New and swaps in the new message+arena. This would make it so a number of paths (*but not all) could avoid growing forever on these many-allocations-discarded objects, including the case in the original repro above.

This will need follow-up investigation though; if an interested contributor wants to explore making that change in the C code around PyUpb_Message_Clear() while keeping our other tests passing (most of the concern would be about what behaviors are observed if someone is holding a variable that is a reference to some submessage of the top level message through those operations), and report back it could accelerate that here, otherwise this might becoming a backburner task for us to get to when we have some slack available.

esrauchg avatar Nov 03 '25 19:11 esrauchg