polars
polars copied to clipboard
Multi-output, multi-sink lazy polars
Description
Let's assume we have a compute graph:
A->B->C->D
and also
C->E
How can I materialize B, E and D as output?
Do I need to do it "the old way", step-by-step, procedurally or can Polars add support for "optimizing and executing together" so it's pipelined (CPU cache friendly) and not doing excessive rechunking/combining (or it's doing it on a separate thread)?
Ah yeah, I have definitely wanted this feature before. It might be very complex for query optimization though in the current state of the engine. You can imagine it dramatically changing the physical plan in cases where your final artifact only depends on a subset of columns and your intermediaries depend on many others.
You can kinda achieve this with a polars.collect_all but I don't think CSE is done between the two plans, so in practice you're doing duplicate work still.
In theory you could maybe do it with results that are lists and structs of various sizes with various window functions that you chop up into multiple frames after materializing.
Polars can cache a lazyframe but you probably also want to say release it after it hits the next node
Our new streaming engine will be able to this. It will be able to multiplex and to multiple intermediate/output nodes.
One step ahead as always. Thanks!
Our new streaming engine will be able to this. It will be able to multiplex and to multiple intermediate/output nodes.
I understand that release of 2.0 version will be done soon?
There is any expected date for this?
This might be possible now using multiplexed sinks.