Support streaming: true on collect

Open josevalim opened this issue 1 year ago • 2 comments

https://github.com/pola-rs/polars/issues/3397#issuecomment-1341188319

Jul 12 '24 11:07 josevalim

Huge +1 for this. I was running into an issue the other day where a sequence of joins on lazy DataFrames was loading way more data than I thought it would into memory, and I'm thinking that streaming might help alleviate this issue.

Jul 16 '24 21:07 spencerkent

Just to make sure I understand correctly: would this mean being able to perform computations (derived columns, typically) on a DataFrame that doesn’t fully fit in RAM (i.e. streamed from disk, with roughly linear memory cost), and then also stream the resulting output row by row — for example to write it back to disk, still keeping the memory footprint linear?

Oct 17 '25 08:10 thbar