VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Virtualize Parquet?

Open TomNicholas opened this issue 10 months ago • 6 comments

Can we Virtualize Parquet data?

See also this spec for storing raster data in Parquet: https://github.com/CartoDB/raquet

TomNicholas avatar Feb 20 '25 17:02 TomNicholas

Omg, racqet awesome

I came here to protest at your ambition, gawd leave Parquet alone it's already perfect! Glad to find racqet though, I see where it encodes the chunks

mdsumner avatar Mar 20 '25 23:03 mdsumner

So I think maybe we can virtualize Parquet. Crucially (from the parquet docs):

  • Column chunk: A chunk of the data for a particular column. They live in a particular row group and are guaranteed to be contiguous in the file.

  • Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk.

The column chunk being being contiguous means it could be fetched with a single http range request. If we map

  • parquet columns -> zarr arrays
  • parquet column chunks -> zarr chunks
  • (maybe) parquet pages -> zarr shards?

then by getting the byte ranges for the column chunks we could write a ParquetParser.

What's interesting is then because Parquet is cloud-optimized, it should be cheap to get those byte ranges, and so you could use the parser as a runtime translation layer (see #603).


The downside of doing this is that because zarr doesn't currently have summary statistics, it has nothing to map parquet's row-group statistics to. So scans over the data will be much more wasteful (slower) than scans using the conventional parquet stack. But there might be applications where this isn't so important (e.g. if you're just getting all of the chunks, or doing very simple selection patterns).

TomNicholas avatar Aug 29 '25 20:08 TomNicholas

But are all of the column chunks in a file guaranteed to be the same size?

rabernat avatar Aug 29 '25 20:08 rabernat

ah. good point. damn.

TomNicholas avatar Aug 29 '25 20:08 TomNicholas

  • parquet column chunks -> zarr chunks

  • (maybe) parquet pages -> zarr shards?

I think this is reversed. column chunks contain multiple pages, in the same way that shards contain multiple chunks. (is chunk -> page better than shard -> chunk? I don't know.)

d-v-b avatar Aug 29 '25 20:08 d-v-b

and we can do variable-length chunks by writing the spec + implementation for that chunk grid extension (maybe a rome task)

d-v-b avatar Aug 29 '25 20:08 d-v-b