streamz
streamz copied to clipboard
Memory leak with _global_sinks
in core.py
the set _global_sinks
captures all constructed sink objects. What was it used for?
This should be either removed or replaced with a WeakSet
I'll check deeper, but I suspect you are probably correct, and perhaps it isn't used at all and should be dropped. You would be welcome to submit a PR.
There is some description why sink
s are cached here: https://streamz.readthedocs.io/en/latest/core.html#modifying-and-cleaning-up-streams
The gist of it is that they anchor the pipeline and need to be explicitly removed as opposed to most other nodes which must have a reference to the node stored somewhere or the pipeline will be garbage collected.
Thanks for the clarification, @CJ-Wright , I probably should have known that. I wonder, then, if streamz should provide an explicit API for working with this global reference store, or should there be something cleverer, whereby a sink removes itself upon the destruction or removal of its upstream (perhaps in conjunction with #248 ).
Thanks @CJ-Wright for the doc link.
I just tried to clear _global_sinks
after setting up streams and this causes the pipeline to get partially garbage collected.
I think a better approach than a global reference store is to use strong references across stream connections (either up- or downstream) so branches can get properly garbage collected if not needed anymore. edit: like what @martindurant said
Btw this is the PR which introduced this behavior: https://github.com/python-streamz/streamz/pull/72
I think one issue with storing strong references in the pipeline, which is that del
might not do the right thing. Since the graph keeps binding references to itself deleting the local variable representation on a node doesn't drop its references to zero.
I think an overriding of __del__
to use disconnect
might work (with sink
also removing itself from the globals).
Doesn't __del__
only ever get called when those references are all gone?
Ah, yes so working with __del__
won't work. But an extension of disconnect
to work with sinks might work?
also xref: https://github.com/python-streamz/streamz/issues/71
disconnect
on a sink
, removing the last/only upstream from it (and there will be no downstreams) should indeed remove it from the global reference holder. That probably doesn't solve the original problem, though.
Stream.from_thing().map().sink();
would maintain a reference forever without a handle to the user - what do they disconnect on?
s = Stream.from_thing().map().sink()
could allow for disconnect/destroy on s
, but then not much point in having this special behaviour of sink objects at all!
Hmm. This is also an issue with sink_to_list
which returns the list and not the node.
class ListSink(list):
def __init__(self, upstream):
self.upstream = upstream
super().__init__()
??
Or specialised sink_to_list
class (looks a lot like accumulate or other state-holding stream)?
Maybe this discussion is somewhat off the rails at this point.
My understanding of the track we're on:
Currently we store weak references for everything except sink
nodes which are tracked in the global sink set. The sink set causes problems because it doesn't get cleared when sinks are disconnected.
In parallel there is a discussion about if we should we should be storing weak references or strong references internally to the graph.
Specifically, we store downstreams as weakref, but concrete upstream, otherwise all instances would insta-disappear in every case. The assumption for ordinary nodes is that either a downstream node will refer to them, or the user will capture a reference themselves.