parquetjs icon indicating copy to clipboard operation
parquetjs copied to clipboard

Execute all readColumnChunk concurrently for a given RowGroup

Open ZJONSSON opened this issue 7 years ago • 5 comments

Offers significant speed improvement when the reader has slow i/o (over network instead of from disk)

ZJONSSON avatar Jan 22 '18 21:01 ZJONSSON

LGTM. If I/O reduction over the network is a concern, we could also optionally disable reading the header entirely as it is just a sanity check and not required to understand the file.

asmuth avatar Feb 11 '18 20:02 asmuth

However, we might also want to benchmark this on files backed by a spinning disk and/or give the user the option to disable parallel/out-of-order reading; I'm not sure off the top of my head if our writer does it, but other writers might write out the column chunks in order (in the data file) so that readers can benefit from read ahead optimization.

asmuth avatar Feb 11 '18 21:02 asmuth

@ZJONSSON, @asmuth and I were discussing this and we want to have two new classes, one for parallel reading and the other for sequential one. @asmuth have serious concerns regarding disk performance.

We also need a benchmark test suite to make sure we are indeed improving stuff and in which scenarios.

We're gonna spend some time tomorrow morning doing this.

kessler avatar Feb 11 '18 21:02 kessler

I agree that concurrency should not be infinite. However I think there are better ways to control it than hard-coding sequential executing for tasks that could be in parallel One way to create controls around maximum concurrent reads would be to wrap the get method in a simple queue where maximum concurrency is defined in options (and a default value)

Additionally: number of actual requests could be optimized by inspecting any simultaneous requests (in the get wrapper) and see if any of the requested buffers are overlapping (or sufficiently close) - in which case a single underlying get command is executed and then the individual get promises are resolved by the corresponding parts of the incoming buffer Ideally we would stream the buffer and resolve the parts when we have them, instead of waiting for the whole buffer to load.

ZJONSSON avatar Feb 12 '18 01:02 ZJONSSON

On the second point, here is a quick branch (very much wip) on the optimization of simultaneous requests. Any reads with close to consecutive segments, i.e. the offset of next request is close to offset + length of previous, will be bundled into a single read.

ZJONSSON avatar Feb 13 '18 22:02 ZJONSSON