How to handle non-JSON serializable attributes?
The NISAR test currently fails because it has an attribute value of inf (the float) which leads to ValueError: Out of range float values are not JSON compliant: inf when trying to write to either Icechunk or Kerchunk. I wonder how we should handle cases on non-JSON serializable attributes with Zarr V3? Some options:
- Add a parameter to
to_icechunkandto_kerchunkthat provides the user the option to raise an error, drop the attribute, or cast to a string - Catch the upstream error an raise a more informative error about which variable / attribute is causing the issue
- Defer to parsers and provide documentation about the requirement for objects to be JSON serializable
Relevant Zarr spec discussion: https://github.com/zarr-developers/zarr-specs/issues/351
It's slow to debug over the network, so a recommended approach for an MVCE is to download https://nisar.asf.earthdatacloud.nasa.gov/NISAR-SAMPLE-DATA/GCOV/ALOS1_Rosamond_20081012/NISAR_L2_PR_GCOV_001_005_A_219_4020_SHNA_A_20081012T060910_20081012T060926_P01101_F_N_J_001.h5 and reproduce locally:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "earthaccess",
# "obstore",
# "virtualizarr[hdf, icechunk]",
# "xarray[io]",
# "zarr>=3.1.3"
# ]
# ///
import xarray as xr
from obstore.store import LocalStore
from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry
from icechunk import Repository, Storage, local_filesystem_storage, RepositoryConfig, VirtualChunkContainer, local_filesystem_store
def main():
data_dir = "/Users/max/Documents/Code/zarr-developers/VirtualiZarr/.data/"
file = "NISAR_L2_PR_GCOV_001_005_A_219_4020_SHNA_A_20081012T060910_20081012T060926_P01101_F_N_J_001.h5"
config = RepositoryConfig.default()
config.set_virtual_chunk_container(
VirtualChunkContainer(
url_prefix=f"file://{data_dir}",
store=local_filesystem_store(data_dir),
),
)
storage = Storage.new_in_memory()
# create an in-memory icechunk repository that includes the virtual chunk containers
repo = Repository.create(storage, config)
session = repo.writable_session("main")
hdf_group = "science/LSAR/GCOV/grids/frequencyA"
store = LocalStore()
registry = ObjectStoreRegistry()
registry.register("file://", store)
drop_variables = ["listOfCovarianceTerms", "listOfPolarizations"]
parser = HDFParser(group=hdf_group, drop_variables=drop_variables)
with (
xr.open_dataset(
f"{data_dir}{file}",
engine="h5netcdf",
group=hdf_group,
drop_variables=drop_variables,
phony_dims="access",
) as dsXR,
open_virtual_dataset(
url=f"file://{data_dir}{file}",
registry=registry,
parser=parser,
) as vds,
):
vds.vz.to_icechunk(session.store)
with xr.open_zarr(session.store, zarr_format=3, consolidated=False) as dsV:
xr.testing.assert_equal(dsXR, dsV)
if __name__ == "__main__":
main()
My thoughts about that NISAR file are here: https://github.com/zarr-developers/VirtualiZarr/pull/713#pullrequestreview-3036255517.
But I think the parsers option sounds best - if possible we want anything that has already been parsed to be serializable as valid Zarr. Any problems should be surfaced as soon as possible, which presumably here means in the parser, or perhaps even checking that the ManifestStore metadata is true JSON upon construction.
this was caused by a regression in zarr python and should be fixed via https://github.com/zarr-developers/zarr-python/pull/3280, which we included in the latest release
FYI I just updated the script in my original comment to use the latest Zarr version and Icechunk syntax to check if this issue can be closed.
Icechunk refuses to write the metadata:
icechunk.IcechunkError: × bad metadata
│
│ context:
│ 0: icechunk::store::set
│ with key="HHHH/zarr.json"
│ at icechunk/src/store.rs:287
│
├─▶ bad metadata
╰─▶ expected value at line 41 column 18
This is the metadata that would be stored:
ArrayV3Metadata(
shape=(6220, 4545),
data_type=Float32(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(98, 143)),
chunk_key_encoding=DefaultChunkKeyEncoding(separator='/'),
fill_value=np.float32(0.0),
codecs=(
BytesCodec(endian=<Endian.little: 'little'>),
Zlib(codec_name='numcodecs.zlib', codec_config={'level': 9})
),
attributes={
'_FillValue': 'AAAAAAAA+H8=',
'grid_mapping': 'projection',
'long_name': 'radar backscatter gamma0',
'max_value': inf,
'mean_value': nan,
'min_value': 1.7640537641749887e-10,
'sample_standard_deviation': nan,
'units': ' ',
'valid_max': nan,
'valid_min': nan
},
dimension_names=('yCoordinates', 'xCoordinates'),
zarr_format=3,
node_type='array',
storage_transformers=()
)
I'm not sure off-hand if this is a virtualizarr, zarr, or icechunk problem now and probably won't have time to look into in the near future.