kerchunk
kerchunk copied to clipboard
Inconsistent behavior between Parquet and JSON when chunks are missing
Taking the first file from here (https://noaa-goes17.s3.amazonaws.com/index.html#ABI-L1b-RadF/2022/001/00/) as an example:
The following code:
import kerchunk.hdf
import json
fname = "OR_ABI-L1b-RadF-M6C01_G17_s20220010000320_e20220010009386_c20220010009424.nc"
h5chunks = kerchunk.hdf.SingleHdf5ToZarr(fname)
refs = h5chunks.translate()
with open("test.json", "w") as f:
f.write(json.dumps(refs, indent=2))
Produces the following JSON output (excerpt; slightly clipped):
"..."
"Rad/.zarray": "{\"chunks\":[226,226],\"compressor\":null,\"dtype\":\"<i2\",\"fill_value\":1023, ..."
"Rad/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"y\",\"x\"],\"_Unsigned\":\"true\",\"add_offset\":-25.9"
"Rad/0.16": "base64:eAHt0DENAAAAAqDD/pk1iIwGNMWAAQMGDBgwYMCAAQMGDBgwYMCAAQMGDBgwYMCAAQMGDBgwYMCA..."
"Rad/0.17": [
"/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
51538,
1448
],
"Rad/0.18": [
"/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
52986,
4155
],
"Rad/0.19": [
"/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
57141,
5554
],
"Rad/0.20": [
"/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s202..."
22412,
7527
],
"..."
Note that the radiance chunks begin at 0.16
--- there is no Rad/0.{0--15}
. That's weird --- I'm assuming this is some HDF5 sparse data cleverness. But in any case, xarray.open_dataset("test.json", engine="kerchunk")
and subsequent summarizing of the entire Rad array (dat.Rad.mean().values
) works fine here.
However, if you spit this out as a Parquet dataset, then it produces a file with rows 0-15 containing nan
paths and 0 values, and then the real data start at row 16. That's fine...except that reading that Parquet file fails with an error like this (full backtrace in details):
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/mapping.py", line 105, in getitems
out = self.fs.cat(keys2, on_error=oe)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 836, in cat
proto_dict = _protocol_groups(path, self.references)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 52, in _protocol_groups
protocol = _prot_in_references(path, references)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 44, in _prot_in_references
return split_protocol(ref[0])[0] if ref[0] else ref[0]
^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/core.py", line 544, in split_protocol
if "://" in urlpath:
^^^^^^^^^^^^^^^^
TypeError: argument of type 'float' is not iterable
I've traced this back to a references.get("Rad/0.0")
call that returns a nan
"url" that can't be parsed by subsequent code. Here's some relevant pdb
traces:
> /gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/core.py(544)split_protocol()
-> if "://" in urlpath:
(Pdb) u
> /gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py(44)_prot_in_references()
-> return split_protocol(ref[0])[0] if ref[0] else ref[0]
(Pdb) ll
41 def _prot_in_references(path, references):
42 ref = references.get(path)
43 if isinstance(ref, (list, tuple)):
44 -> return split_protocol(ref[0])[0] if ref[0] else ref[0]
(Pdb) p ref
[nan]
(Pdb) p path
'Rad/0.0.0'
(Pdb)
Traceback (most recent call last):
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/read.py", line 6, in <module>
print(combined_ds.Rad.mean().values)
^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/_aggregations.py", line 1664, in mean
return self.reduce(
^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/dataarray.py", line 3826, in reduce
var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/variable.py", line 1663, in reduce
result = super().reduce(
^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/namedarray/core.py", line 912, in reduce
data = func(self.data, **kwargs)
^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/variable.py", line 449, in data
return self._data.get_duck_array()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 837, in get_duck_a
rray
self._ensure_cached()
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 831, in _ensure_ca
ched
self.array = as_indexable(self.array.get_duck_array())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 788, in get_duck_a
rray
return self.array.get_duck_array()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 837, in get_duck_a
rray
self._ensure_cached()
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 831, in _ensure_ca
ched
self.array = as_indexable(self.array.get_duck_array())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 788, in get_duck_a
rray
return self.array.get_duck_array()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 658, in get_duck_a
rray
array = array.get_duck_array()
^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/coding/variables.py", line 81, in get_duck_array
return self.func(self.array.get_duck_array())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 651, in get_duck_array
array = self.array[self.key]
~~~~~~~~~~^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/backends/zarr.py", line 104, in __getitem__
return indexing.explicit_indexing_adapter(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/core/indexing.py", line 1015, in explicit_indexing_adapter
result = raw_indexing_method(raw_key.tuple)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/xarray/backends/zarr.py", line 94, in _getitem
return self._array[key]
~~~~~~~~~~~^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 798, in __getitem__
result = self.get_orthogonal_selection(pure_selection, fields=fields)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 1080, in get_orthogonal_selection
return self._get_selection(indexer=indexer, out=out, fields=fields)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 1343, in _get_selection
self._chunk_getitems(
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/core.py", line 2179, in _chunk_getitems
cdatas = self.chunk_store.getitems(ckeys, contexts=contexts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/zarr/storage.py", line 1426, in getitems
results_transformed = self.map.getitems(list(keys_transformed), on_error="return")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/mapping.py", line 105, in getitems
out = self.fs.cat(keys2, on_error=oe)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 836, in cat
proto_dict = _protocol_groups(path, self.references)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 52, in _protocol_groups
protocol = _prot_in_references(path, references)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/implementations/reference.py", line 44, in _prot_in_references
return split_protocol(ref[0])[0] if ref[0] else ref[0]
^^^^^^^^^^^^^^^^^^^^^^
File "/gpfsm/dnb33/ashiklom/aist-eso/goes-virtualizarr/.pixi/envs/default/lib/python3.12/site-packages/fsspec/core.py", line 544, in split_protocol
if "://" in urlpath:
^^^^^^^^^^^^^^^^
TypeError: argument of type 'float' is not iterable