feat(perf): the default writer should facilitate high throughput of the default reader on most arrays
There is currently a specific problem: the default writer preserves the chunking of its input; however, the default reader forces a 64Ki batch size (this is configurable but defaults to 64Ki; see vortex-serde/src/layouts/read/mod.rs). This batch size is achieved by slicing the array.
In general, slicing has some unavoidable cost, but for Dictionary encoding this cost is particularly perverse. A sliced, Dictionary-encoded array preserves the values (aka the dictionary). Each operation on the sliced arrays independently decodes the values of interest because there is no place to stash a shared decoded array. This appears to interact particularly badly with bitpacking wherein decoding a single element is substantially more expensive than decoding a 1024 element chunk on a per-element basis.
fwiw right now we rechunk to 64k elements when compressing
I believe we only greedily combine chunks, we don't split large ones. The bad case I had was a ~8M row array being sliced into ~100 64Ki arrays, each of which had to repeatedly decode the dictionary.
Oh, I have forgotten about this case
The reader no longer supports an externally requested chunk size, so this particular speed bump is gone.