VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Error reading inlined reference data when trying to roundtrip virtual dataset

Open jacobbieker opened this issue 10 months ago • 4 comments

Hi,

I've been working on trying to use virtualizarr to create virtual datasets of GOES/Himawari/G2KA AWS data. The end goal is to generate virtual references appended along the t dimension for GOES data. One of the main issues I've encountered so far with using virtualizarr is after writing virtualizarr to disk, reading it back results in this error

NotImplementedError: Reading inlined reference data is currently not supported. [ToDo]

I can read it with xr.open_dataset but I want to eventually combine all the virtual references into one large virtual dataset, so need to be able to open the virtual datasets.

import xarray as xr
from virtualizarr import open_virtual_dataset
import s3fs

fs = s3fs.S3FileSystem(anon=True)
filepath = "s3://noaa-goes19/ABI-L1b-RadF/2025/001/00/OR_ABI-L1b-RadF-M6C08_G19_s20250010010205_e20250010019513_c20250010019570.nc"
vd = open_virtual_dataset(filepath, loadable_variables=["t",],reader_options={'storage_options': {"anon": True}})
print(vd)
vd.virtualize.to_kerchunk('g19.json', format='json')
d = xr.open_dataset("g19.json", engine="kerchunk", backend_kwargs={"storage_options": {"remote_options": {"anon": True}}})
print(d)
vd1 = open_virtual_dataset("g19.json", filetype="kerchunk")
print(vd1)

I get that its not supported yet, but either, how would I have virtualizarr not write inlined data, or is there any information/resources to point to where I could try adding support for it?

xarray version: 2025.1.2 s3fs version: 2025.2.0 virtualizarr version: 1.3.2

jacobbieker avatar Mar 14 '25 18:03 jacobbieker

Also, not sure if this is related, but doing the above and trying to write to Parquet file also fails:

import xarray as xr
from virtualizarr import open_virtual_dataset
import s3fs

fs = s3fs.S3FileSystem(anon=True)
filepath = "s3://noaa-goes19/ABI-L1b-RadF/2025/001/00/OR_ABI-L1b-RadF-M6C08_G19_s20250010010205_e20250010019513_c20250010019570.nc"
vd = open_virtual_dataset(filepath, loadable_variables=["t",],reader_options={'storage_options': {"anon": True}})
print(vd)
vd.virtualize.to_kerchunk('g19.parquet', format='parquet')

with

KeyError: 'algorithm_dynamic_input_data_container/.zarray'

jacobbieker avatar Mar 14 '25 19:03 jacobbieker

This is an important and annoying missing feature when serializing to kerchunk. Thanks for raising it explicitly.

However, if you use Icechunk as a serialization format instead you won't have this problem.

TomNicholas avatar Mar 14 '25 19:03 TomNicholas

Okay, thank you. I did also try using icechunk for this same example, snippet here: https://github.com/zarr-developers/VirtualiZarr/issues/485#issuecomment-2725543238 but also with issues decoding the FillValue

jacobbieker avatar Mar 15 '25 08:03 jacobbieker

I found a work around for the case where you only have inlined coordinates:

def open_virtual_with_inligned(uri):
    kds = xr.open_dataset(
        uri,
        engine="kerchunk"
    )
    mds =  open_virtual_dataset(
        uri,
        registry=registry, 
        parser=KerchunkParquetParser(skip_variables=kds.coords)
    )
    for k in kds.coords:
        mds.coords[k]=kds[k]
    return mds

the resulting mds can be stored with .vz.to_kerchunk again and it seems to create inlined bytes for the coords 😆

wachsylon avatar Sep 04 '25 11:09 wachsylon