arrow-tools icon indicating copy to clipboard operation
arrow-tools copied to clipboard

When creating/inferring schema only, do not buffer stdin

Open corneliusroemer opened this issue 1 year ago • 2 comments

It's safest to infer the schema on the entire dataset.

When the dataset is larger than RAM, this is currently not possible via stdin as the implementation in #10 and #13 stores everything that's used for inference in memory.

In practice, one could stream the dataset via stdin twice: first time to get the schema, second time to convert.

This needs some internal changes to not buffer when options are set to infer schema only.

corneliusroemer avatar Mar 05 '23 10:03 corneliusroemer