polars
polars copied to clipboard
fix: convert `run_pipeline_no_finalize` from recursive to iterative in order to avoid stack overflows
I ran into a stack overflow using polars in one of our projects—I was able to pin it down to run_pipeline_no_finalize, which is indeed recursive.
I've converted the algorithm to be iterative, which was not super-straightforward. Happy to iterate (pun intended) on the code to clean it up however you think it's best.
OK, something is not quite right—trying to figure it out, but I'm struggling.
Found it—I forgot to reset some iterator-local state 🤦🏻
Wow, this hasn't gotten any easier to follow. :')
Any idea what query caused the stackoverflow?
The same query I mentioned in https://github.com/pola-rs/polars/issues/11829 minus the streaming part, with the important caveat that it is running over a large Parquet dataset that's (Hive-)partitioned into thousands of small files.
The same query I mentioned in #11829 minus the streaming part, with the important caveat that it is running over a large Parquet dataset that's (Hive-)partitioned into thousands of small files.
Yes, I estimated that. This is now fixed by #11922. Where instead of a union per file, we create a single source that can handle all those files. So the stackoverflow should be resolved by better handling multi-file datasets.
This currently is only done for parquet files, but I want to implement this for all file types we have. Can you give it a try with a compiled version of main?
Since the original issue that this PR was trying address seems to have been resolved, I will close this. Thanks for the initiative though @LukeMathWalker !