VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

How to handle non-JSON serializable attributes?

Open maxrjones opened this issue 5 months ago • 3 comments

The NISAR test currently fails because it has an attribute value of inf (the float) which leads to ValueError: Out of range float values are not JSON compliant: inf when trying to write to either Icechunk or Kerchunk. I wonder how we should handle cases on non-JSON serializable attributes with Zarr V3? Some options:

  • Add a parameter to to_icechunk and to_kerchunk that provides the user the option to raise an error, drop the attribute, or cast to a string
  • Catch the upstream error an raise a more informative error about which variable / attribute is causing the issue
  • Defer to parsers and provide documentation about the requirement for objects to be JSON serializable

Relevant Zarr spec discussion: https://github.com/zarr-developers/zarr-specs/issues/351

It's slow to debug over the network, so a recommended approach for an MVCE is to download https://nisar.asf.earthdatacloud.nasa.gov/NISAR-SAMPLE-DATA/GCOV/ALOS1_Rosamond_20081012/NISAR_L2_PR_GCOV_001_005_A_219_4020_SHNA_A_20081012T060910_20081012T060926_P01101_F_N_J_001.h5 and reproduce locally:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "earthaccess",
#     "obstore",
#     "virtualizarr[hdf, icechunk]",
#     "xarray[io]",
#     "zarr>=3.1.3"
# ]
# ///


import xarray as xr
from obstore.store import LocalStore

from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry
from icechunk import Repository, Storage, local_filesystem_storage, RepositoryConfig, VirtualChunkContainer, local_filesystem_store


def main():
    data_dir = "/Users/max/Documents/Code/zarr-developers/VirtualiZarr/.data/"
    file = "NISAR_L2_PR_GCOV_001_005_A_219_4020_SHNA_A_20081012T060910_20081012T060926_P01101_F_N_J_001.h5"

    config = RepositoryConfig.default()
    config.set_virtual_chunk_container(
        VirtualChunkContainer(
            url_prefix=f"file://{data_dir}",
            store=local_filesystem_store(data_dir),
        ),
    )

    storage = Storage.new_in_memory()
    # create an in-memory icechunk repository that includes the virtual chunk containers
    repo = Repository.create(storage, config)
    session = repo.writable_session("main")

    hdf_group = "science/LSAR/GCOV/grids/frequencyA"
    store = LocalStore()
    registry = ObjectStoreRegistry()
    registry.register("file://", store)
    drop_variables = ["listOfCovarianceTerms", "listOfPolarizations"]
    parser = HDFParser(group=hdf_group, drop_variables=drop_variables)
    with (
        xr.open_dataset(
            f"{data_dir}{file}",
            engine="h5netcdf",
            group=hdf_group,
            drop_variables=drop_variables,
            phony_dims="access",
        ) as dsXR,
        open_virtual_dataset(
            url=f"file://{data_dir}{file}",
            registry=registry,
            parser=parser,
        ) as vds,
    ):
        vds.vz.to_icechunk(session.store)

        with xr.open_zarr(session.store, zarr_format=3, consolidated=False) as dsV:    
            xr.testing.assert_equal(dsXR, dsV)

if __name__ == "__main__":
    main()

maxrjones avatar Jul 20 '25 18:07 maxrjones

My thoughts about that NISAR file are here: https://github.com/zarr-developers/VirtualiZarr/pull/713#pullrequestreview-3036255517.

But I think the parsers option sounds best - if possible we want anything that has already been parsed to be serializable as valid Zarr. Any problems should be surfaced as soon as possible, which presumably here means in the parser, or perhaps even checking that the ManifestStore metadata is true JSON upon construction.

TomNicholas avatar Jul 21 '25 00:07 TomNicholas

this was caused by a regression in zarr python and should be fixed via https://github.com/zarr-developers/zarr-python/pull/3280, which we included in the latest release

d-v-b avatar Aug 08 '25 12:08 d-v-b

FYI I just updated the script in my original comment to use the latest Zarr version and Icechunk syntax to check if this issue can be closed.

Icechunk refuses to write the metadata:

icechunk.IcechunkError:   × bad metadata
  │ 
  │ context:
  │    0: icechunk::store::set
  │            with key="HHHH/zarr.json"
  │              at icechunk/src/store.rs:287
  │ 
  ├─▶ bad metadata
  ╰─▶ expected value at line 41 column 18

This is the metadata that would be stored:

ArrayV3Metadata(
    shape=(6220, 4545),
    data_type=Float32(endianness='little'),
    chunk_grid=RegularChunkGrid(chunk_shape=(98, 143)),
    chunk_key_encoding=DefaultChunkKeyEncoding(separator='/'),
    fill_value=np.float32(0.0),
    codecs=(
        BytesCodec(endian=<Endian.little: 'little'>),
        Zlib(codec_name='numcodecs.zlib', codec_config={'level': 9})
    ),
    attributes={
        '_FillValue': 'AAAAAAAA+H8=',
        'grid_mapping': 'projection',
        'long_name': 'radar backscatter gamma0',
        'max_value': inf,
        'mean_value': nan,
        'min_value': 1.7640537641749887e-10,
        'sample_standard_deviation': nan,
        'units': ' ',
        'valid_max': nan,
        'valid_min': nan
    },
    dimension_names=('yCoordinates', 'xCoordinates'),
    zarr_format=3,
    node_type='array',
    storage_transformers=()
)

I'm not sure off-hand if this is a virtualizarr, zarr, or icechunk problem now and probably won't have time to look into in the near future.

maxrjones avatar Nov 03 '25 21:11 maxrjones