streamz
streamz copied to clipboard
Separate parallel backend from frontend
Recently I've been trying to split my pipelines up into chunks so the chunks can be re-used as much as possible. One issue I foresee running into is that scatter
may not work as intended. scatter
will send the data to the cluser for processing, but since the downstream nodes are already defined (I'm linking them via connect
) they won't properly operate on the data (since they aren't DaskStream
nodes).
It seems that it may be beneficial to separate the front end pipeline definition from the decision about which backend is being used, at least at the class level. This would make pipelines more modular since we don't need to re-write the same pipeline for each backend. Additionally this would allow the user executing a pipeline to decide what backend they want/are able to use.
One approach to handle this could be to specify a client for the entire pipeline, where a client
mimics some of the dask
client API (mostly submit
and loop
I think). If the client associated with any node is changed/updated then the entire pipeline shifts to use that client. This would also require nodes to know if they are downstream of a scatter but haven't been gathered yet, since then they would know to use the pipeline's client rather than standard local execution (which most likely should be a dummy client
). Maybe this can be inspected when the client is set, since it could change during the pipeline build process?
I believe you had a simpler implementation of this idea that what is being described above? Having scatter/gather not use dask explicitly seems like an OK ting to want to me, but personally I also think there's a lot to be gained from the extra knowledge we have about the execution framework.
Can you elaborate on what is gained from the knowledge of the framework? I do have a working implementation of this which I'll try to PR soon.