pizzarr icon indicating copy to clipboard operation
pizzarr copied to clipboard

exploring virtual zarr

Open mdsumner opened this issue 8 months ago • 0 comments

Posting this just to share what I've learned. this is NOT a feature request or bug report, just wanting to leave my understanding and notes somewhere relevant. (Thanks!)

pizzarr is currently focused on "real Zarr" that has normal chunk data at endpoints like "varname/0.0..."

There's also virtual-zarr, which consists of references to legacy files and the byte ranges of encoded chunks within there, there is JSON, Parquet, and now Icechunk variants of this.

GDAL is working towards support the JSON and Parquet variants in this PR: https://github.com/OSGeo/gdal/pull/12099

When we open this Parquet store

u <- "https://projects.pawsey.org.au/vzarr/NSIDC_SEAICE_PS_S25km.parquet"
z <- zarr_open(u); z$get_item("x")$as.array()

we get to pizzarr code at https://github.com/keller-mark/pizzarr/blob/5f4705789ba29344dcb97023f290a26a7f7abd0c/R/stores.R#L378

path <- paste(private$base_path, key, sep="/")

which now looks for vzarr/NSIDC_SEAICE_PS_S25km.parquet/x/0 (at https://projects.pawsey.org.au/), but this doesn't exist - in kerchunk terms we need to look into

vzarr/NSIDC_SEAICE_PS_S25km.parquet/x/refs.0.parq

that contains one row of the actual binary encoded chunk for x

arrow::read_parquet("https://projects.pawsey.org.au/vzarr/NSIDC_SEAICE_PS_S25km.parquet/x/refs.0.parq")[1, ]
  path offset size
1 <NA>      0    0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             raw
1 00, 00, 00, 00, 6e, 0a, 4e, c1, 00, 00, 00, 00, 9a, d9, 4d, c1, 00, 00, 00, 00, c6, a8, 4d, c1, 00, 00, 00, 00, f2, 77 ...

if the variable was not loaded we would have NULL in raw and rather a list of references to the actual netcdf files, byte ranges for each chunk these look like

arrow::read_parquet("https://projects.pawsey.org.au/vzarr/NSIDC_SEAICE_PS_S25km.parquet/ICECON/refs.0.parq")[1:5, ]
                                                                                                                           path offset  size  raw
1 s3://idea-10.5067-mpyg15waa4wx/n5eil01u.ecs.nsidc.org/PM/NSIDC-0051.002/1978.10.26/NSIDC0051_SEAICE_PS_S25km_19781026_v2.0.nc  48113 29672 NULL
2 s3://idea-10.5067-mpyg15waa4wx/n5eil01u.ecs.nsidc.org/PM/NSIDC-0051.002/1978.10.28/NSIDC0051_SEAICE_PS_S25km_19781028_v2.0.nc  48113 30610 NULL
3 s3://idea-10.5067-mpyg15waa4wx/n5eil01u.ecs.nsidc.org/PM/NSIDC-0051.002/1978.10.30/NSIDC0051_SEAICE_PS_S25km_19781030_v2.0.nc  48113 30410 NULL
4 s3://idea-10.5067-mpyg15waa4wx/n5eil01u.ecs.nsidc.org/PM/NSIDC-0051.002/1978.11.01/NSIDC0051_SEAICE_PS_S25km_19781101_v2.0.nc  48113 30281 NULL
5 s3://idea-10.5067-mpyg15waa4wx/n5eil01u.ecs.nsidc.org/PM/NSIDC-0051.002/1978.11.03/NSIDC0051_SEAICE_PS_S25km_19781103_v2.0.nc  48113 30288 NULL

Please note that the s3 path references there are assuming configuration for non-S3 object storage:

"AWS_S3_ENDPOINT", "projects.pawsey.org.au"
"AWS_VIRTUAL_HOSTING", "FALSE"
"AWS_NO_SIGN_REQUEST", "YES"

I'm worried about how these formats are not part of the specification for Zarr, they seem to be a convenience serialization for xarray and a way to usurp requirements to use actual format libraries (and it's very fast compared to that). It seems a bit sloppy that the chunk index is not formally stored in the Parquet (so, there must a full set in the table/s and it's assumed to increment the chunk index in a standard way from any shape. It's also seemingly not defined about where to find the references for Parquet (in json they are named elements by chunk index), I think you are expected to assume the list of refs.[%i].parq extend for as long as needed for chunks of chunk references (defaults to 100K).

So, thanks for listening, just a note and a query as to whether this has been considered. I'd be interested to help extend pizzarr for this format, though that will be fairly challenging for me.

mdsumner avatar Apr 13 '25 03:04 mdsumner