vortex feat(perf): the default writer should facilitate high throughput of the default reader on most arrays

There is currently a specific problem: the default writer preserves the chunking of its input; however, the default reader forces a 64Ki batch size (this is configurable but defaults to 64Ki; see vortex-serde/src/layouts/read/mod.rs). This batch size is achieved by slicing the array.

In general, slicing has some unavoidable cost, but for Dictionary encoding this cost is particularly perverse. A sliced, Dictionary-encoded array preserves the values (aka the dictionary). Each operation on the sliced arrays independently decodes the values of interest because there is no place to stash a shared decoded array. This appears to interact particularly badly with bitpacking wherein decoding a single element is substantially more expensive than decoding a 1024 element chunk on a per-element basis.

Oct 11 '24 15:10 danking

fwiw right now we rechunk to 64k elements when compressing

Oct 11 '24 15:10 robert3005

I believe we only greedily combine chunks, we don't split large ones. The bad case I had was a ~8M row array being sliced into ~100 64Ki arrays, each of which had to repeatedly decode the dictionary.

Oct 11 '24 16:10 danking

Oh, I have forgotten about this case

Oct 11 '24 16:10 robert3005

The reader no longer supports an externally requested chunk size, so this particular speed bump is gone.

Dec 16 '24 17:12 danking