do not materialize entire files in `to_record_batches`

Open tom-at-rewbi opened this issue 2 months ago • 0 comments

Rationale for this change

My expectation for to_record_batches was that it would yield batches and not materialize an entire parquet file in memory, but it looks like the current implementation explicitly does this.

This change makes to_record_batches iterate batches lazily.

Are these changes tested?

I ran make test and all tests completed successfully except those that import kerberos (2 of them), as I do not have it installed and it does not seem to build for me at the moment.

As for testing that this reduces memory usage, this change made my data pipeline stop OOM'ing.

Are there any user-facing changes?

There should not be any user-facing changes.

Oct 31 '25 23:10 tom-at-rewbi