Virtualize Parquet?
Can we Virtualize Parquet data?
See also this spec for storing raster data in Parquet: https://github.com/CartoDB/raquet
Omg, racqet awesome
I came here to protest at your ambition, gawd leave Parquet alone it's already perfect! Glad to find racqet though, I see where it encodes the chunks
So I think maybe we can virtualize Parquet. Crucially (from the parquet docs):
Column chunk: A chunk of the data for a particular column. They live in a particular row group and are guaranteed to be contiguous in the file.
Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk.
The column chunk being being contiguous means it could be fetched with a single http range request. If we map
- parquet columns -> zarr arrays
- parquet column chunks -> zarr chunks
- (maybe) parquet pages -> zarr shards?
then by getting the byte ranges for the column chunks we could write a ParquetParser.
What's interesting is then because Parquet is cloud-optimized, it should be cheap to get those byte ranges, and so you could use the parser as a runtime translation layer (see #603).
The downside of doing this is that because zarr doesn't currently have summary statistics, it has nothing to map parquet's row-group statistics to. So scans over the data will be much more wasteful (slower) than scans using the conventional parquet stack. But there might be applications where this isn't so important (e.g. if you're just getting all of the chunks, or doing very simple selection patterns).
But are all of the column chunks in a file guaranteed to be the same size?
ah. good point. damn.
parquet column chunks -> zarr chunks
(maybe) parquet pages -> zarr shards?
I think this is reversed. column chunks contain multiple pages, in the same way that shards contain multiple chunks. (is chunk -> page better than shard -> chunk? I don't know.)
and we can do variable-length chunks by writing the spec + implementation for that chunk grid extension (maybe a rome task)