Respect physical data placement in parquet iterators
parquet.List of structs colocates fields on the same page. This means that we should never create an individual iterator for each of the columns in such cases (example): in fact, we fetch same pages repeatedly.
In addition, parquet reader issues a read operation for every ReadBufferPage size (+ page/group bounds), which prevents efficient streaming of data ranges from object storage.
I don't think your assumption is correct, each column lives physically in separate pages, what I do think happens with repeated/list it enforces the page boundaries to be at the same row numbers, not too sure if this is a parquet-go limitation or if it depends on how we write the files.
What we should do instead is taking account of the offset indexes and have a "prefetch" on the ReaderAt, level. A bit like we do when opening the file with the footer.
Here is what I've done to verify this with a block from ops:
cc := rowGroup.ColumnChunks()
for idx, x := range colunms {
if idx != 4 && idx != 5 {
continue
}
oidx, err := cc[idx].OffsetIndex()
require.NoError(t, err)
pages := oidx.NumPages()
if pages == 0 {
t.Logf("No pages for column %d (%v)", idx, x)
continue
}
offset := oidx.Offset(0)
length := oidx.Offset(pages-1) - offset + oidx.CompressedPageSize(pages-1)
t.Logf("Columns %d (%v) pages %d offsetStart %d offsetEnd %d", idx, x, pages, offset, offset+length)
}
Columns 4 ([Samples list element StacktraceID]) pages 14 offsetStart 181774 offsetEnd 2778660
Columns 5 ([Samples list element Value]) pages 14 offsetStart 2778660 offsetEnd 4926313