Severe memory leak/accumulation - Due to ZeroMQ subscription data build up
Description
At NIWA we've noticed the consumption of roughly 5Gb a day of memory from long running UIS (via cylc gui):
As a work around we've been restarting the UIS every 4-5 days...
We're not stopping and starting workflows under different runs all the time, we have a fairly static ~57 workflows in the operation, and a similar number in the test system..
Reproducible Example
Just running a UIS instance with one workflow running will, given time, show a reduction in free memory on the machine.
After analyzing objects in the UIS the problem didn't appear to be normal objects like the data store and other UIS attributes:
However, using this tool: https://github.com/bloomberg/memray on a UIS with 5 small workflows I was able to find the exact location of problem:
Shows an increase over time:
And It appeared to be the executor threads of the data-store used for ZeroMQ subscriptions (getting data from the Scheduler(s)), threads which this tool can isolate:
More specifically the delta processing after receipt is accumulating in size:
Which I suspect is something to do with the data going from the subscription thread to the main thread..
Expected Behaviour
For a relatively static load of workflows, the UIS should consume a static amount of memory (not grow).
Well it appears not to be the threading, I took the processing out of the thread and put the incoming deltas into a queue:
(iterating over the queue with a periodic callback)
And the data is piling up in the main thread:
To be sure, I'll remove threading altogether.. and just call the ZeroMQ receive with the periodic call back.. However, it does seem to be something in the processing..
Maybe a memory leak in protobuf? There was one: https://github.com/protocolbuffers/protobuf/issues/10088
But that should be fixed at our version (4.24.4)
Removed threading altogether: https://github.com/cylc/cylc-uiserver/pull/683
But the memory accumulation is still present:
So I've found the issue, it's with using long lived upb protobuf objects: https://github.com/protocolbuffers/protobuf/issues/19674 https://github.com/protocolbuffers/protobuf/issues/10294
If you use the python implementation by setting:
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
The apply_delta accumulation goes away:
We can do one of two things;
- Change the Apply delta to produce new objects from the old (so avoid long lived objects)
- Set that above env var, to use the python implementation
I'm going to start with the first option.. Because upb is faster, and the fix will effect both scheduler and UIS.
Also should only need to recreate new workflow object and maybe a couple of other things, because TaskProxy objects eventually get into a final state and pruned thereafter anyway..
For example:
#!/usr/bin/python
from time import sleep
from cylc.flow.data_messages_pb2 import PbWorkflow
from cylc.flow.data_store_mgr import reset_protobuf_object
wbuf = PbWorkflow(id='this')
wbuf2 = PbWorkflow(id='that', status='sprinting')
while True:
wbuf.status_msg = 'workflow is running' # This will accumulate memory
wbuf.MergeFrom(wbuf2) # This will accumulate memory too
sleep(0.00001)
Then running
$ memray run --live ./test_mem.py
will show the memory snowball:
However setting:
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
Or using a function like:
def reset_protobuf_object(msg_class, msg_orig):
"""Reset object to clear memory build-up."""
# See: https://github.com/protocolbuffers/protobuf/issues/19674
# The new message instantiation needs happen on a separate line.
new_msg = msg_class()
new_msg.ParseFromString(
msg_orig.SerializeToString(deterministic=True)
)
return new_msg
while True:
wbuf.status_msg = 'workflow is running' # This will accumulate memory
wbuf.MergeFrom(wbuf2) # This will accumulate memory too
wbuf = reset_protobuf_object(PbWorkflow, wbuf) # This will clear the accumulation
sleep(0.00001)
will result in static memory use:
I have reproduced your results with this script (derived from the above):
#!/usr/bin/python
from time import sleep
from cylc.flow.data_messages_pb2 import PbWorkflow
def reset_protobuf_object(msg_class, msg_orig):
"""Reset object to clear memory build-up."""
# See: https://github.com/protocolbuffers/protobuf/issues/19674
# The new message instantiation needs happen on a separate line.
new_msg = msg_class()
new_msg.ParseFromString(
msg_orig.SerializeToString(deterministic=True)
)
return new_msg
wbuf = PbWorkflow(id='this')
wbuf2 = PbWorkflow(id='that', status='sprinting')
for itt in range(1000000):
wbuf.status_msg = 'workflow is running' # This will accumulate memory
wbuf.MergeFrom(wbuf2) # This will accumulate memory too
sleep(0.00001)
if itt % 100000 == 0:
print(itt)
wbuf = reset_protobuf_object(PbWorkflow, wbuf)
Running with the reset_protobuf_object line commented out there is a clear stepped pattern as blocks of memory are acquired:
Running with reset_protobuf_object line included, there is a clear sawtooth pattern lining up with each call of reset_protobuf_object:
OK wow, so is this a bug in protobuf or are we using it wrongly? (I've not had a chance to look at the details, sorry).
From the issues David linked, it sounds like its a performance "feature".
The big problem is that this issue would appear to hit every single protobuf object? Which makes it rather difficult to defend against. The only thing we seem to be able to do is to serialise and deserialise the message periodically, which is pretty nasty.
The big problem is that this issue would appear to hit every single protobuf object? Which makes it rather difficult to defend against.
Most protobuf objects are short-lived and have finite actions on them to accumulate memory (it appears to be changing fields)
The only real problem is the long lived PbWorkflow and PbTask objects..
The only thing we seem to be able to do is to serialise and deserialise the message periodically, which is pretty nasty.
Oliver and I remedied this, I had taken it to the extreme while testing, but a less impactful strategy is now in place.
The protobuf memory leak has now been resolved.
Is this memory leak outstanding? If so, do we have any evidence that it is ZMQ?
Note, data should only accumulate in ZMQ if the UIS is receiving deltas faster than it can apply them which is not impossible.
Thank you for this write-up! I'm hitting this same issue, and your example was very helpful in convincing myself that this was in fact the problem.
@oliver-sanders , I'm having trouble finding the commit where you dealt with the issue. Would you mind pointing me at how you implemented the fix?
@sritchie, the protobuf memory leak was fixed in this PR https://github.com/cylc/cylc-flow/pull/6727 (the PR is in the cylc-flow library that is imported by cylc-uiserver).
The memory leak was caused by a design oddity in the protobuf library (i.e. it's not a bug, it's a feature?!) that we have now worked around.
Would recommend updating to the latest Cylc 8.5 stack and seeing how things fare.
The memory usage of the Cylc UI Server is determined by what the user(s) are doing with it. More & larger workflows => higher memory footprint. At present the data for stopped workflows is not cleared until they are deleted, so hoarding stopped workflows is a known way to inflate memory usage and may result in a gradual increase (we'll restrict this in a future release).
Beyond that, anything that looks like a leak on the latest releases, let us know.
@dwsutherland, I think we should close this one now.
If we detect further issues, those should go in a new issue.
Thanks @oliver-sanders ! I was actually running into the problem in a separate project, and was curious to see what the "minimal diff" was for a fix. Thanks for the reference, this was excellent.
Evidence at my site suggests that upgrading to Cylc 8.5 fixed the problem.
Yes, I haven't seen this problem at NIWA since 8.5.0, It appears to have been to do with the old GraphQL/graphene library (< v3) ..
Also the protobuf accumulation was addressed..