protobuf
protobuf copied to clipboard
[Python] Potential memory leak
What version of protobuf and what language are you using? Version: 6.33.0 Language: Python
What operating system (Linux, Windows, ...) and version? OS: Ubuntu Linux 22.0.1 Kernel: 6.8.0-87-generic
What runtime / compiler are you using (e.g., python version or gcc version) Python 3.12
What did you do? I've hacked together a little example repo (kudos to @Atheuz for the original that I've forked from), but essentially it is as follows:
obj = schema_pb2.SomeObj()
for _ in range(10_000_000):
obj.ParseFromString(b"...")
del obj
What did you expect to see
That there would be no growth in memory usage
What did you see instead?
Until the obj is deleted, the memory grew uncontrollably.
Anything else we should know about your project / environment
This is a continuation of https://github.com/protocolbuffers/protobuf/issues/10088 (I think).
I've started a Google Group Discussion where it was explained to me, that the behaviour is intended and expected.
But it is still kind of counterintuitive to see that Unpack and ParseFromString actually allocate more and more memory.
I think it deserves at least a mention in the docs on how to properly use them. And thus I was advised to open an issue here.
Thank you!
Just initial response, it looks like https://github.com/protocolbuffers/protobuf/issues/10088 was a bit different, at least that the root cause work that Josh drove it was about C free() behavior not even 'releasing' the memory to the OS. It may be that one of the latter comments that identified they still had a problem were confused it with implications of the Py-upb Arena model though.
For anyone stumbled across this thing as I did, there are 2 workarounds, as it seems:
-
Add
PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=pythonbefore running your script. This will enforcepythonbackend instead ofupbbackend, which seems to be MUCH slower, but does not "leak" -
Simply delete and recreate the object every time. E.g. this example does not leak at all:
for _ in range(10_000_000): obj = schema_pb2.SomeObj() obj.ParseFromString(b"...")
Second "just initial response" I think we actually should be able to change the behavior such that PyUpb_Message_Clear actually just effectively does PyUpb_Message_New and swaps in the new message+arena. This would make it so a number of paths (*but not all) could avoid growing forever on these many-allocations-discarded objects, including the case in the original repro above.
This will need follow-up investigation though; if an interested contributor wants to explore making that change in the C code around PyUpb_Message_Clear() while keeping our other tests passing (most of the concern would be about what behaviors are observed if someone is holding a variable that is a reference to some submessage of the top level message through those operations), and report back it could accelerate that here, otherwise this might becoming a backburner task for us to get to when we have some slack available.