polars fix: convert `run_pipeline_no_finalize` from recursive to iterative in order to avoid stack overflows

fix: convert `run_pipeline_no_finalize` from recursive to iterative in order to avoid stack overflows

Open LukeMathWalker opened this issue 2 years ago • 5 comments

I ran into a stack overflow using polars in one of our projects—I was able to pin it down to run_pipeline_no_finalize, which is indeed recursive.

I've converted the algorithm to be iterative, which was not super-straightforward. Happy to iterate (pun intended) on the code to clean it up however you think it's best.

Oct 20 '23 14:10 LukeMathWalker

OK, something is not quite right—trying to figure it out, but I'm struggling.

Oct 20 '23 15:10 LukeMathWalker

Found it—I forgot to reset some iterator-local state 🤦🏻

Oct 20 '23 15:10 LukeMathWalker

Wow, this hasn't gotten any easier to follow. :')

Any idea what query caused the stackoverflow?

Oct 23 '23 15:10 ritchie46

The same query I mentioned in https://github.com/pola-rs/polars/issues/11829 minus the streaming part, with the important caveat that it is running over a large Parquet dataset that's (Hive-)partitioned into thousands of small files.

Oct 23 '23 16:10 LukeMathWalker

The same query I mentioned in #11829 minus the streaming part, with the important caveat that it is running over a large Parquet dataset that's (Hive-)partitioned into thousands of small files.

Yes, I estimated that. This is now fixed by #11922. Where instead of a union per file, we create a single source that can handle all those files. So the stackoverflow should be resolved by better handling multi-file datasets.

This currently is only done for parquet files, but I want to implement this for all file types we have. Can you give it a try with a compiled version of main?

Oct 24 '23 05:10 ritchie46

Since the original issue that this PR was trying address seems to have been resolved, I will close this. Thanks for the initiative though @LukeMathWalker !

Feb 10 '24 15:02 stinodego

polars polars copied to clipboard

fix: convert `run_pipeline_no_finalize` from recursive to iterative in order to avoid stack overflows

polars
polars copied to clipboard