opteryx icon indicating copy to clipboard operation
opteryx copied to clipboard

✨ Scan (and split and merge) nodes should emit morsels of 10k(?) rows

Open joocer opened this issue 1 year ago • 0 comments
trafficstars

Morsels are currently the size of the read block, for the exemplar user this is usually blocks of up to 64Mb. Tables that exceed 64Mb are split into multiple morsels and then each morsel is processed in series.

Where the source dataset is larger than 64Mb, for example with the billion-row challenge, the entire dataset is loaded and pushed through the execution pipeline together. Whilst the defragmenter may split this if/when implemented, it still loads the entire dataset initially.

The scan nodes should take steps to limit the size of the data they load and the data they pump into the engine. For example, don't load a 13Gb and then split into morsels, try to load the source file in chunks.

joocer avatar Jan 25 '24 17:01 joocer