streamz icon indicating copy to clipboard operation
streamz copied to clipboard

Memory leak with _global_sinks

Open fl4p-old opened this issue 5 years ago • 14 comments

in core.py the set _global_sinks captures all constructed sink objects. What was it used for? This should be either removed or replaced with a WeakSet

fl4p-old avatar May 22 '19 13:05 fl4p-old

I'll check deeper, but I suspect you are probably correct, and perhaps it isn't used at all and should be dropped. You would be welcome to submit a PR.

martindurant avatar May 22 '19 13:05 martindurant

There is some description why sinks are cached here: https://streamz.readthedocs.io/en/latest/core.html#modifying-and-cleaning-up-streams

The gist of it is that they anchor the pipeline and need to be explicitly removed as opposed to most other nodes which must have a reference to the node stored somewhere or the pipeline will be garbage collected.

CJ-Wright avatar May 22 '19 14:05 CJ-Wright

Thanks for the clarification, @CJ-Wright , I probably should have known that. I wonder, then, if streamz should provide an explicit API for working with this global reference store, or should there be something cleverer, whereby a sink removes itself upon the destruction or removal of its upstream (perhaps in conjunction with #248 ).

martindurant avatar May 22 '19 14:05 martindurant

Thanks @CJ-Wright for the doc link. I just tried to clear _global_sinks after setting up streams and this causes the pipeline to get partially garbage collected.

I think a better approach than a global reference store is to use strong references across stream connections (either up- or downstream) so branches can get properly garbage collected if not needed anymore. edit: like what @martindurant said

fl4p-old avatar May 22 '19 14:05 fl4p-old

Btw this is the PR which introduced this behavior: https://github.com/python-streamz/streamz/pull/72

CJ-Wright avatar May 22 '19 14:05 CJ-Wright

I think one issue with storing strong references in the pipeline, which is that del might not do the right thing. Since the graph keeps binding references to itself deleting the local variable representation on a node doesn't drop its references to zero.

I think an overriding of __del__ to use disconnect might work (with sink also removing itself from the globals).

CJ-Wright avatar May 22 '19 15:05 CJ-Wright

Doesn't __del__ only ever get called when those references are all gone?

martindurant avatar May 22 '19 15:05 martindurant

Ah, yes so working with __del__ won't work. But an extension of disconnect to work with sinks might work?

CJ-Wright avatar May 22 '19 15:05 CJ-Wright

also xref: https://github.com/python-streamz/streamz/issues/71

CJ-Wright avatar May 22 '19 15:05 CJ-Wright

disconnect on a sink, removing the last/only upstream from it (and there will be no downstreams) should indeed remove it from the global reference holder. That probably doesn't solve the original problem, though.

Stream.from_thing().map().sink();

would maintain a reference forever without a handle to the user - what do they disconnect on?

s = Stream.from_thing().map().sink()

could allow for disconnect/destroy on s, but then not much point in having this special behaviour of sink objects at all!

martindurant avatar May 22 '19 15:05 martindurant

Hmm. This is also an issue with sink_to_list which returns the list and not the node.

CJ-Wright avatar May 22 '19 16:05 CJ-Wright

class ListSink(list):
    def __init__(self, upstream):
        self.upstream = upstream
        super().__init__()

?? Or specialised sink_to_list class (looks a lot like accumulate or other state-holding stream)?

Maybe this discussion is somewhat off the rails at this point.

martindurant avatar May 22 '19 16:05 martindurant

My understanding of the track we're on:

Currently we store weak references for everything except sink nodes which are tracked in the global sink set. The sink set causes problems because it doesn't get cleared when sinks are disconnected.

In parallel there is a discussion about if we should we should be storing weak references or strong references internally to the graph.

CJ-Wright avatar May 22 '19 16:05 CJ-Wright

Specifically, we store downstreams as weakref, but concrete upstream, otherwise all instances would insta-disappear in every case. The assumption for ordinary nodes is that either a downstream node will refer to them, or the user will capture a reference themselves.

martindurant avatar May 22 '19 16:05 martindurant