Respect physical data placement in parquet iterators

Open kolesnikovae opened this issue 6 months ago • 1 comments

parquet.List of structs colocates fields on the same page. This means that we should never create an individual iterator for each of the columns in such cases (example): in fact, we fetch same pages repeatedly.

In addition, parquet reader issues a read operation for every ReadBufferPage size (+ page/group bounds), which prevents efficient streaming of data ranges from object storage.

Jun 02 '25 06:06 kolesnikovae

I don't think your assumption is correct, each column lives physically in separate pages, what I do think happens with repeated/list it enforces the page boundaries to be at the same row numbers, not too sure if this is a parquet-go limitation or if it depends on how we write the files.

What we should do instead is taking account of the offset indexes and have a "prefetch" on the ReaderAt, level. A bit like we do when opening the file with the footer.

Here is what I've done to verify this with a block from ops:

	cc := rowGroup.ColumnChunks()
	for idx, x := range colunms {
		if idx != 4 && idx != 5 {
			continue
		}

		oidx, err := cc[idx].OffsetIndex()
		require.NoError(t, err)

		pages := oidx.NumPages()
		if pages == 0 {
			t.Logf("No pages for column %d (%v)", idx, x)
			continue
		}

		offset := oidx.Offset(0)
		length := oidx.Offset(pages-1) - offset + oidx.CompressedPageSize(pages-1)
		t.Logf("Columns %d (%v) pages %d offsetStart %d offsetEnd %d", idx, x, pages, offset, offset+length)
	}

 Columns 4 ([Samples list element StacktraceID])      pages 14 offsetStart 181774  offsetEnd 2778660
 Columns 5 ([Samples list element Value])             pages 14 offsetStart 2778660 offsetEnd 4926313

Jun 02 '25 10:06 simonswine