cylc-uiserver icon indicating copy to clipboard operation
cylc-uiserver copied to clipboard

Severe memory leak/accumulation - Due to ZeroMQ subscription data build up

Open dwsutherland opened this issue 8 months ago • 9 comments

Description

At NIWA we've noticed the consumption of roughly 5Gb a day of memory from long running UIS (via cylc gui):

Image

As a work around we've been restarting the UIS every 4-5 days...

We're not stopping and starting workflows under different runs all the time, we have a fairly static ~57 workflows in the operation, and a similar number in the test system..

Reproducible Example

Just running a UIS instance with one workflow running will, given time, show a reduction in free memory on the machine.

After analyzing objects in the UIS the problem didn't appear to be normal objects like the data store and other UIS attributes:

Image

Image

However, using this tool: https://github.com/bloomberg/memray on a UIS with 5 small workflows I was able to find the exact location of problem:

Image

Shows an increase over time:

Image

Image

And It appeared to be the executor threads of the data-store used for ZeroMQ subscriptions (getting data from the Scheduler(s)), threads which this tool can isolate:

Image

More specifically the delta processing after receipt is accumulating in size:

Image

Which I suspect is something to do with the data going from the subscription thread to the main thread..

Expected Behaviour

For a relatively static load of workflows, the UIS should consume a static amount of memory (not grow).

dwsutherland avatar Apr 10 '25 06:04 dwsutherland

Well it appears not to be the threading, I took the processing out of the thread and put the incoming deltas into a queue:

Image

(iterating over the queue with a periodic callback)

And the data is piling up in the main thread:

Image

To be sure, I'll remove threading altogether.. and just call the ZeroMQ receive with the periodic call back.. However, it does seem to be something in the processing..

dwsutherland avatar Apr 11 '25 10:04 dwsutherland

Maybe a memory leak in protobuf? There was one: https://github.com/protocolbuffers/protobuf/issues/10088

But that should be fixed at our version (4.24.4)

dwsutherland avatar Apr 12 '25 10:04 dwsutherland

Removed threading altogether: https://github.com/cylc/cylc-uiserver/pull/683

But the memory accumulation is still present:

Image

dwsutherland avatar Apr 12 '25 13:04 dwsutherland

So I've found the issue, it's with using long lived upb protobuf objects: https://github.com/protocolbuffers/protobuf/issues/19674 https://github.com/protocolbuffers/protobuf/issues/10294

If you use the python implementation by setting: export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python The apply_delta accumulation goes away:

Image

We can do one of two things;

  • Change the Apply delta to produce new objects from the old (so avoid long lived objects)
  • Set that above env var, to use the python implementation

I'm going to start with the first option.. Because upb is faster, and the fix will effect both scheduler and UIS. Also should only need to recreate new workflow object and maybe a couple of other things, because TaskProxy objects eventually get into a final state and pruned thereafter anyway..

dwsutherland avatar Apr 13 '25 10:04 dwsutherland

For example:

#!/usr/bin/python

from time import sleep

from cylc.flow.data_messages_pb2 import PbWorkflow
from cylc.flow.data_store_mgr import reset_protobuf_object


wbuf = PbWorkflow(id='this')
wbuf2 = PbWorkflow(id='that', status='sprinting')
while True:
    wbuf.status_msg = 'workflow is running'  # This will accumulate memory
    wbuf.MergeFrom(wbuf2)  # This will accumulate memory too
    sleep(0.00001)

Then running $ memray run --live ./test_mem.py will show the memory snowball: Image

However setting: export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python Or using a function like:

def reset_protobuf_object(msg_class, msg_orig):
    """Reset object to clear memory build-up."""
    # See: https://github.com/protocolbuffers/protobuf/issues/19674
    # The new message instantiation needs happen on a separate line.
    new_msg = msg_class()
    new_msg.ParseFromString(
        msg_orig.SerializeToString(deterministic=True)
    )
    return new_msg
while True:
    wbuf.status_msg = 'workflow is running'  # This will accumulate memory
    wbuf.MergeFrom(wbuf2)  # This will accumulate memory too
    wbuf = reset_protobuf_object(PbWorkflow, wbuf)  # This will clear the accumulation
    sleep(0.00001)

will result in static memory use: Image

dwsutherland avatar Apr 14 '25 08:04 dwsutherland

I have reproduced your results with this script (derived from the above):

#!/usr/bin/python

from time import sleep

from cylc.flow.data_messages_pb2 import PbWorkflow


def reset_protobuf_object(msg_class, msg_orig):
    """Reset object to clear memory build-up."""
    # See: https://github.com/protocolbuffers/protobuf/issues/19674
    # The new message instantiation needs happen on a separate line.
    new_msg = msg_class()
    new_msg.ParseFromString(
        msg_orig.SerializeToString(deterministic=True)
    )
    return new_msg


wbuf = PbWorkflow(id='this')
wbuf2 = PbWorkflow(id='that', status='sprinting')
for itt in range(1000000):
    wbuf.status_msg = 'workflow is running'  # This will accumulate memory
    wbuf.MergeFrom(wbuf2)  # This will accumulate memory too
    sleep(0.00001)
    if itt % 100000 == 0:
        print(itt)
        wbuf = reset_protobuf_object(PbWorkflow, wbuf)

Running with the reset_protobuf_object line commented out there is a clear stepped pattern as blocks of memory are acquired:

Image

Running with reset_protobuf_object line included, there is a clear sawtooth pattern lining up with each call of reset_protobuf_object:

Image

oliver-sanders avatar Apr 14 '25 14:04 oliver-sanders

OK wow, so is this a bug in protobuf or are we using it wrongly? (I've not had a chance to look at the details, sorry).

hjoliver avatar Apr 15 '25 00:04 hjoliver

From the issues David linked, it sounds like its a performance "feature".

The big problem is that this issue would appear to hit every single protobuf object? Which makes it rather difficult to defend against. The only thing we seem to be able to do is to serialise and deserialise the message periodically, which is pretty nasty.

oliver-sanders avatar Apr 15 '25 08:04 oliver-sanders

The big problem is that this issue would appear to hit every single protobuf object? Which makes it rather difficult to defend against.

Most protobuf objects are short-lived and have finite actions on them to accumulate memory (it appears to be changing fields) The only real problem is the long lived PbWorkflow and PbTask objects..

The only thing we seem to be able to do is to serialise and deserialise the message periodically, which is pretty nasty.

Oliver and I remedied this, I had taken it to the extreme while testing, but a less impactful strategy is now in place.

dwsutherland avatar Apr 30 '25 04:04 dwsutherland

The protobuf memory leak has now been resolved.

Is this memory leak outstanding? If so, do we have any evidence that it is ZMQ?

Note, data should only accumulate in ZMQ if the UIS is receiving deltas faster than it can apply them which is not impossible.

oliver-sanders avatar Jul 28 '25 12:07 oliver-sanders

Thank you for this write-up! I'm hitting this same issue, and your example was very helpful in convincing myself that this was in fact the problem.

@oliver-sanders , I'm having trouble finding the commit where you dealt with the issue. Would you mind pointing me at how you implemented the fix?

sritchie avatar Jul 30 '25 06:07 sritchie

@sritchie, the protobuf memory leak was fixed in this PR https://github.com/cylc/cylc-flow/pull/6727 (the PR is in the cylc-flow library that is imported by cylc-uiserver).

The memory leak was caused by a design oddity in the protobuf library (i.e. it's not a bug, it's a feature?!) that we have now worked around.

Would recommend updating to the latest Cylc 8.5 stack and seeing how things fare.

The memory usage of the Cylc UI Server is determined by what the user(s) are doing with it. More & larger workflows => higher memory footprint. At present the data for stopped workflows is not cleared until they are deleted, so hoarding stopped workflows is a known way to inflate memory usage and may result in a gradual increase (we'll restrict this in a future release).

Beyond that, anything that looks like a leak on the latest releases, let us know.

oliver-sanders avatar Aug 04 '25 10:08 oliver-sanders

@dwsutherland, I think we should close this one now.

If we detect further issues, those should go in a new issue.

oliver-sanders avatar Aug 04 '25 11:08 oliver-sanders

Thanks @oliver-sanders ! I was actually running into the problem in a separate project, and was curious to see what the "minimal diff" was for a fix. Thanks for the reference, this was excellent.

sritchie avatar Aug 04 '25 16:08 sritchie

Evidence at my site suggests that upgrading to Cylc 8.5 fixed the problem.

hjoliver avatar Aug 05 '25 02:08 hjoliver

Yes, I haven't seen this problem at NIWA since 8.5.0, It appears to have been to do with the old GraphQL/graphene library (< v3) ..

Also the protobuf accumulation was addressed..

dwsutherland avatar Oct 21 '25 03:10 dwsutherland