kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Inconsistent behavior between Parquet and JSON when chunks are missing

Open ashiklom opened this issue 6 months ago • 10 comments

Taking the first file from here (https://noaa-goes17.s3.amazonaws.com/index.html#ABI-L1b-RadF/2022/001/00/) as an example:

The following code:

import kerchunk.hdf
import json
                                                                                                                                                     
fname = "OR_ABI-L1b-RadF-M6C01_G17_s20220010000320_e20220010009386_c20220010009424.nc"
                                                                                                                                                     
h5chunks = kerchunk.hdf.SingleHdf5ToZarr(fname)
refs = h5chunks.translate()
                                                                                                                                                     
with open("test.json", "w") as f:
    f.write(json.dumps(refs, indent=2))

Produces the following JSON output (excerpt; slightly clipped):

"..."
"Rad/.zarray": "{\"chunks\":[226,226],\"compressor\":null,\"dtype\":\"<i2\",\"fill_value\":1023, ..."
"Rad/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"y\",\"x\"],\"_Unsigned\":\"true\",\"add_offset\":-25.9"
"Rad/0.16": "base64:eAHt0DENAAAAAqDD/pk1iIwGNMWAAQMGDBgwYMCAAQMGDBgwYMCAAQMGDBgwYMCAAQMGDBgwYMCA..."
"Rad/0.17": [
  "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
  51538,
  1448
],
"Rad/0.18": [
  "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
  52986,
  4155
],
"Rad/0.19": [
  "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
  57141,
  5554
],
"Rad/0.20": [
  "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
  22412,
  7527
],
"..."

Note that the radiance chunks begin at 0.16 --- there is no Rad/0.{0--15}. That's weird --- I'm assuming this is some HDF5 sparse data cleverness. But in any case, xarray.open_dataset("test.json", engine="kerchunk") and subsequent summarizing of the entire Rad array (dat.Rad.mean().values) works fine here.

However, if you spit this out as a Parquet dataset, then it produces a file with rows 0-15 containing nan paths and 0 values, and then the real data start at row 16. That's fine...except that reading that Parquet file fails with an error like this (full backtrace in details):

  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/mapping.py", line 105, in getitems
    out = self.fs.cat(keys2, on_error=oe)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 836, in cat
    proto_dict = _protocol_groups(path, self.references)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 52, in _protocol_groups
    protocol = _prot_in_references(path, references)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 44, in _prot_in_references
    return split_protocol(ref[0])[0] if ref[0] else ref[0]
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/core.py", line 544, in split_protocol
    if "://" in urlpath:
       ^^^^^^^^^^^^^^^^
TypeError: argument of type 'float' is not iterable

I've traced this back to a references.get("Rad/0.0") call that returns a nan "url" that can't be parsed by subsequent code. Here's some relevant pdb traces:

> /gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/core.py(544)split_protocol()
-> if "://" in urlpath:
(Pdb) u
> /gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py(44)_prot_in_references()
-> return split_protocol(ref[0])[0] if ref[0] else ref[0]
(Pdb) ll
 41     def _prot_in_references(path, references):
 42         ref = references.get(path)
 43         if isinstance(ref, (list, tuple)):
 44  ->         return split_protocol(ref[0])[0] if ref[0] else ref[0]
(Pdb) p ref
[nan]
(Pdb) p path
'Rad/0.0.0'
(Pdb)

Traceback (most recent call last):
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/read.py", line 6, in <module>
    print(combined_ds.Rad.mean().values)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/_aggregations.py", line 1664, in mean
    return self.reduce(
           ^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/dataarray.py", line 3826, in reduce
    var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/variable.py", line 1663, in reduce
    result = super().reduce(
             ^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/namedarray/core.py", line 912, in reduce
    data = func(self.data, **kwargs)
                ^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/variable.py", line 449, in data
    return self._data.get_duck_array()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 837, in get_duck_a
rray
    self._ensure_cached()
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 831, in _ensure_ca
ched
    self.array = as_indexable(self.array.get_duck_array())
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 788, in get_duck_a
rray
    return self.array.get_duck_array()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 837, in get_duck_a
rray
    self._ensure_cached()
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 831, in _ensure_ca
ched
    self.array = as_indexable(self.array.get_duck_array())
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 788, in get_duck_a
rray
    return self.array.get_duck_array()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 658, in get_duck_a
rray
    array = array.get_duck_array()
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/coding/variables.py", line 81, in get_duck_array
    return self.func(self.array.get_duck_array())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 651, in get_duck_array
    array = self.array[self.key]
            ~~~~~~~~~~^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/backends/zarr.py", line 104, in __getitem__
    return indexing.explicit_indexing_adapter(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 1015, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/backends/zarr.py", line 94, in _getitem
    return self._array[key]
           ~~~~~~~~~~~^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 798, in __getitem__
    result = self.get_orthogonal_selection(pure_selection, fields=fields)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 1080, in get_orthogonal_selection
    return self._get_selection(indexer=indexer, out=out, fields=fields)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 1343, in _get_selection
    self._chunk_getitems(
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 2179, in _chunk_getitems
    cdatas = self.chunk_store.getitems(ckeys, contexts=contexts)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/storage.py", line 1426, in getitems
    results_transformed = self.map.getitems(list(keys_transformed), on_error="return")
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/mapping.py", line 105, in getitems
    out = self.fs.cat(keys2, on_error=oe)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 836, in cat
    proto_dict = _protocol_groups(path, self.references)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 52, in _protocol_groups
    protocol = _prot_in_references(path, references)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 44, in _prot_in_references
    return split_protocol(ref[0])[0] if ref[0] else ref[0]
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/core.py", line 544, in split_protocol
    if "://" in urlpath:
       ^^^^^^^^^^^^^^^^
TypeError: argument of type 'float' is not iterable

ashiklom avatar Aug 13 '24 20:08 ashiklom