kerchunk
kerchunk copied to clipboard
scan_grib is not outputting all the pieces dimensions correctly.
I am very new at this, so I might not comprehend the exact goal of scan_grib
but here we go (this is a follow up from discourse):
What's the issue: scan_grib
doesn't outputs all the "parts" that I expect to see in the json files, hence when i read the file through xarray I am missing parts of it. Something is wrong in the writing of the parsing of the file that comes out of _split_file
. I explored a bit, but couldn't find what the issue is.
grib_file to download:
!wget https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2000/2000010100/c00/Days:1-10/acpcp_sfc_2000010100_c00.grib2
versions I am using:
import xarray as xr
import cfgrib
dsx = xr.open_dataset('acpcp_sfc_2000010100_c00.grib2', engine='cfgrib')
dscf = cfgrib.open_file('acpcp_sfc_2000010100_c00.grib2')
in both cases (which are essentially the same thing) the dimensions are recognized correctly, in particular:
dsx
dscf
Dataset(
dimensions={'step': 80, 'latitude': 721, 'longitude': 1440},
variables={'number': Variable(dimensions=(), data=0),
'time': Variable(dimensions=(), data=946684800),
'step': Variable(dimensions=('step',), data=array([ 3., 6., 9., 12., 15., 18., 21., 24., 27., 30., 33.,
...condensed..for...clarity
201., 204., 207., 210., 213., 216., 219., 222., 225., 228., 231.,
234., 237., 240.])),
'surface': Variable(dimensions=(), data=0.0),
'latitude': Variable(dimensions=('latitude',), data=array([ 90. , 89.75, 89.5 , 89.25, 89. , 88.75, 88.5 , 88.25,
88. , 87.75, 87.5 , 87.25, 87. , 86.75, 86.5 , 86.25,
...condensed..for...clarity
-88. , -88.25, -88.5 , -88.75, -89. , -89.25, -89.5 , -89.75,
-90. ])),
'longitude': Variable(dimensions=('longitude',), data=array([0.0000e+00, 2.5000e-01, 5.0000e-01, ..., 3.5925e+02, 3.5950e+02,
3.5975e+02])), 'valid_time': Variable(dimensions=('step',), data=array([9.466956e+08, 9.467064e+08, 9.467172e+08, 9.467280e+08,
...condensed..for...clarity
9.475164e+08, 9.475272e+08, 9.475380e+08, 9.475488e+08])),
'acpcp': Variable(dimensions=('step', 'latitude', 'longitude'),
data=OnDiskArray(index=FileIndex(fieldset=FileStream(path='acpcp_sfc_2000010100_c00.grib2', errors='warn'), index_keys=['centre', 'centreDescription', 'dataDate', 'dataTime', 'dataType', 'directionNumber', 'edition', 'endStep', 'frequencyNumber', 'gridType', 'level:float', 'number', 'numberOfPoints', 'paramId', 'step', 'stepType', 'stepUnits', 'subCentre', 'time', 'typeOfLevel'], filter_by_keys={'paramId': 3063}, computed_keys={}, index_protocol_version='2'), shape=(80, 721, 1440), missing_value=9999))}, attributes={'GRIB_edition': 2, 'GRIB_centre': 'kwbc', 'GRIB_centreDescription': 'US National Weather Service - NCEP', 'GRIB_subCentre': 2, 'Conventions': 'CF-1.7', 'institution': 'US National Weather Service - NCEP', 'history': '2022-04-14T18:50 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {"source": "acpcp_sfc_2000010100_c00.grib2", "filter_by_keys": {}, "encode_cf": ["parameter", "time", "geography", "vertical"]}'}, encoding={'source': 'acpcp_sfc_2000010100_c00.grib2', 'filter_by_keys': {}, 'encode_cf': ('parameter', 'time', 'geography', 'vertical')})
Note: step is a dimension, just like latitude and longitude and time. time is particular, has no dimension
and one data
, step, just like latitude and longitude, has values in all the fields, again, in particular, i see no difference between:
'step': Variable(dimensions=('step',), data=array([ 3., 6., 9., 12., 15., 18., 21., 24., 27., 30., 33.,
...condensed..for...clarity
201., 204., 207., 210., 213., 216., 219., 222., 225., 228., 231.,
234., 237., 240.])),
and
'latitude': Variable(dimensions=('latitude',), data=array([ 90. , 89.75, 89.5 , 89.25, 89. , 88.75, 88.5 , 88.25,
88. , 87.75, 87.5 , 87.25, 87. , 86.75, 86.5 , 86.25,
...condensed..for...clarity
-88. , -88.25, -88.5 , -88.75, -89. , -89.25, -89.5 , -89.75,
-90. ])),
however, when I do:
!pip install kerchunk
import warnings
warnings.simplefilter("ignore") # filter some warning messages
import xarray as xr
import hvplot.xarray
import datetime as dt
import pandas as pd
import dask
import panel as pn
import json
import fsspec
from kerchunk.grib2 import scan_grib
from kerchunk.combine import MultiZarrToZarr
and
so = {"anon": True, "default_cache_type": "readahead"}
outscan = scan_grib('acpcp_sfc_2000010100_c00.grib2',
common_vars = ['time', 'step', 'latitude', 'longitude', 'valid_time'], storage_options=so )
it takes a second, and outscan
looks like:
{'version': 1,
'refs': {'.zgroup': '{\n "zarr_format": 2\n}',
'.zattrs': '{\n "Conventions": "CF-1.7",\n "GRIB_centre": "kwbc",\n "GRIB_centreDescription": "US National Weather Service - NCEP",\n "GRIB_edition": 2,\n "GRIB_subCentre": 2,\n "history": "2022-04-14T19:03 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {\\"source\\": \\"../../tmp/tmpq40ghn6agrib2\\", \\"filter_by_keys\\": {}, \\"encode_cf\\": [\\"parameter\\", \\"time\\", \\"geography\\", \\"vertical\\"]}",\n "institution": "US National Weather Service - NCEP"\n}',
'time/.zarray': '{\n "chunks": [],\n "compressor": null,\n "dtype": "<i8",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [],\n "zarr_format": 2\n}',
'time/0': 'base64:gENtOAAAAAA=',
'time/.zattrs': '{\n "_ARRAY_DIMENSIONS": [],\n "calendar": "proleptic_gregorian",\n "long_name": "initial time of forecast",\n "standard_name": "forecast_reference_time",\n "units": "seconds since 1970-01-01T00:00:00"\n}',
'step/.zarray': '{\n "chunks": [],\n "compressor": null,\n "dtype": "<f8",\n "fill_value": 0.0,\n "filters": null,\n "order": "C",\n "shape": [],\n "zarr_format": 2\n}',
'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
'step/.zattrs': '{\n "_ARRAY_DIMENSIONS": [],\n "long_name": "time since forecast_reference_time",\n "standard_name": "forecast_period",\n "units": "hours"\n}',
'latitude/.zarray': '{\n "chunks": [\n 721\n ],\n "compressor": null,\n "dtype": "<f8",\n "fill_value": 0.0,\n "filters": [\n {\n "id": "grib",\n "var": "latitude"\n }\n ],\n "order": "C",\n "shape": [\n 721\n ],\n "zarr_format": 2\n}',
'latitude/0': ['{{u}}', 0, 325317],
'latitude/.zattrs': '{\n "_ARRAY_DIMENSIONS": [\n "latitude"\n ],\n "long_name": "latitude",\n "standard_name": "latitude",\n "stored_direction": "decreasing",\n "units": "degrees_north"\n}',
'longitude/.zarray': '{\n "chunks": [\n 1440\n ],\n "compressor": null,\n "dtype": "<f8",\n "fill_value": 0.0,\n "filters": [\n {\n "id": "grib",\n "var": "longitude"\n }\n ],\n "order": "C",\n "shape": [\n 1440\n ],\n "zarr_format": 2\n}',
'longitude/0': ['{{u}}', 0, 325317],
'longitude/.zattrs': '{\n "_ARRAY_DIMENSIONS": [\n "longitude"\n ],\n "long_name": "longitude",\n "standard_name": "longitude",\n "units": "degrees_east"\n}',
'valid_time/.zarray': '{\n "chunks": [],\n "compressor": null,\n "dtype": "<f8",\n "fill_value": 0.0,\n "filters": null,\n "order": "C",\n "shape": [],\n "zarr_format": 2\n}',
'valid_time/0': 'base64:AAAA2LY2zEE=',
'valid_time/.zattrs': '{\n "_ARRAY_DIMENSIONS": [],\n "calendar": "proleptic_gregorian",\n "long_name": "time",\n "standard_name": "time",\n "units": "seconds since 1970-01-01T00:00:00"\n}',
'acpcp/.zarray': '{\n "chunks": [\n 721,\n 1440\n ],\n "compressor": null,\n "dtype": "<f4",\n "fill_value": 9999.0,\n "filters": [\n {\n "id": "grib",\n "var": "acpcp"\n }\n ],\n "order": "C",\n "shape": [\n 721,\n 1440\n ],\n "zarr_format": 2\n}',
'acpcp/0.0': ['{{u}}', 29740164, 413329],
'acpcp/.zattrs': '{\n "GRIB_NV": 0,\n "GRIB_Nx": 1440,\n "GRIB_Ny": 721,\n "GRIB_cfName": "lwe_thickness_of_convective_precipitation_amount",\n "GRIB_cfVarName": "acpcp",\n "GRIB_dataType": "cf",\n "GRIB_gridDefinitionDescription": "Latitude/longitude. Also called equidistant cylindrical, or Plate Carree",\n "GRIB_gridType": "regular_ll",\n "GRIB_iDirectionIncrementInDegrees": 0.25,\n "GRIB_iScansNegatively": 0,\n "GRIB_jDirectionIncrementInDegrees": 0.25,\n "GRIB_jPointsAreConsecutive": 0,\n "GRIB_jScansPositively": 0,\n "GRIB_latitudeOfFirstGridPointInDegrees": 90.0,\n "GRIB_latitudeOfLastGridPointInDegrees": -90.0,\n "GRIB_longitudeOfFirstGridPointInDegrees": 0.0,\n "GRIB_longitudeOfLastGridPointInDegrees": 359.75,\n "GRIB_missingValue": 9999,\n "GRIB_name": "Convective precipitation (water)",\n "GRIB_numberOfPoints": 1038240,\n "GRIB_paramId": 3063,\n "GRIB_shortName": "acpcp",\n "GRIB_stepType": "accum",\n "GRIB_stepUnits": 1,\n "GRIB_totalNumber": 10,\n "GRIB_typeOfLevel": "surface",\n "GRIB_units": "kg m**-2",\n "_ARRAY_DIMENSIONS": [\n "latitude",\n "longitude"\n ],\n "coordinates": "number time step surface latitude longitude valid_time",\n "long_name": "Convective precipitation (water)",\n "standard_name": "lwe_thickness_of_convective_precipitation_amount",\n "units": "kg m**-2"\n}'},
'templates': {'u': 'acpcp_sfc_2000010100_c00.grib2'}}
Note the part regarding step
:
'step/.zarray': '{\n "chunks": [],\n "compressor": null,\n "dtype": "<f8",\n "fill_value": 0.0,\n "filters": null,\n "order": "C",\n "shape": [],\n "zarr_format": 2\n}',
'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
'step/.zattrs': '{\n "_ARRAY_DIMENSIONS": [],\n "long_name": "time since forecast_reference_time",\n "standard_name": "forecast_period",\n "units": "hours"\n}',
what you notice is that these are all empty:
- "chunks": [],
- "shape": [],
- "filters": null,\n
- 'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
- "_ARRAY_DIMENSIONS": []
differently, for latitude or longitude:
'latitude/.zarray': '{\n "chunks": [\n 721\n ],\n "compressor": null,\n "dtype": "<f8",\n "fill_value": 0.0,\n "filters": [\n {\n "id": "grib",\n "var": "latitude"\n }\n ],\n "order": "C",\n "shape": [\n 721\n ],\n "zarr_format": 2\n}',
'latitude/0': ['{{u}}', 0, 325317],
'latitude/.zattrs': '{\n "_ARRAY_DIMENSIONS": [\n "latitude"\n ],\n "long_name": "latitude",\n "standard_name":
you have:
- "chunks": [\n 721\n ],\n
- "filters": [\n {\n "id": "grib",\n "var": "latitude"\n }\n ],\n
- "shape": [\n 721\n ],
- 'latitude/0': ['{{u}}', 0, 325317],
- "_ARRAY_DIMENSIONS": [\n "latitude"\n ],\n
Ok - so I went and looked into grib2.py
and I isolated the part where the reading of the variables defined as common_vars
happens (link):
if common is False:
# done for first valid message
logger.debug("Common variables")
z.attrs.update(ds.attributes)
for var in common_vars:
# assume grid, etc is the same across all messages
attr = ds.variables[var].attributes or {}
attr['_ARRAY_DIMENSIONS'] = ds.variables[var].dimensions
_store_array(store, z, ds.variables[var].data, var, inline_threashold, offset, size,
attr)
If I simply do as below (for simplicity I limit common_vars
to two, step and latitude):
dscf = cfgrib.open_file('acpcp_sfc_2000010100_c00.grib2')
for var in ['step', 'latitude', ]:
# assume grid, etc is the same across all messages
print(var)
print(dscf.variables[var].attributes)
print(dscf.variables[var].dimensions)
print(dscf.variables[var].data)
I get:
step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
('step',)
[ 3. 6. 9. 12. 15. 18. 21. 24. 27. 30. 33. 36. 39. 42.
...condensed..for...clarity
213. 216. 219. 222. 225. 228. 231. 234. 237. 240.]
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90. 89.75 89.5 89.25 89. 88.75 88.5 88.25 88. 87.75
...condensed..for...clarity
-87.5 -87.75 -88. -88.25 -88.5 -88.75 -89. -89.25 -89.5 -89.75
-90. ]
So the problem is NOT the grib2 file (which often times can be).
If I copy and paste from grib2.py
the relevant code to run _split_file
import tempfile
def _split_file(f, skip=0):
if hasattr(f, "size"):
size = f.size
else:
f.seek(0, 2)
size = f.tell()
f.seek(0)
part = 0
while f.tell() < size:
# logger.debug(f"extract part {part + 1}")
start = f.tell()
f.seek(12, 1)
part_size = int.from_bytes(f.read(4), "big")
f.seek(start)
data = f.read(part_size)
assert data[:4] == b"GRIB"
assert data[-4:] == b"7777"
fn = tempfile.mktemp(suffix="grib2")
with open(fn, "wb") as fo:
fo.write(data)
yield fn, start, part_size
part += 1
if skip and part > skip:
break
and run:
import cfgrib
url = 'acpcp_sfc_2000010100_c00.grib2'
with fsspec.open(url, "rb") as f:
for fn, offset, size in _split_file(f, skip=0):
dscf = cfgrib.open_file(fn)
for var in ['step', 'latitude', ]:
# assume grid, etc is the same across all messages
print(var)
print(dscf.variables[var].attributes)
print(dscf.variables[var].dimensions)
print(dscf.variables[var].data)
what I get is the same as in outscan
above BUT I get one print out for each instance of step
:
step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
3.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90. 89.75 89.5 89.25 89. 88.75 88.5 88.25 88. 87.75
...condensed..for...clarity
-87.5 -87.75 -88. -88.25 -88.5 -88.75 -89. -89.25 -89.5 -89.75
-90. ]
step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
6.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90. 89.75 89.5 89.25 89. 88.75 88.5 88.25 88. 87.75
...condensed..for...clarity
-90. ]
step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
9.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90. 89.75 89.5 89.25 89. 88.75 88.5 88.25 88. 87.75
...condensed..for...clarity
-87.5 -87.75 -88. -88.25 -88.5 -88.75 -89. -89.25 -89.5 -89.75
-90. ]
however, once again, when I ran scan_grib
(let's use only step and latitude as common vars):
so = {"anon": True, "default_cache_type": "readahead"}
outscansl = scan_grib('acpcp_sfc_2000010100_c00.grib2',
common_vars = [ 'step', 'latitude', ], storage_options=so )
it will take a while (almost as it is indeed going through all the 80 steps of step
) but the output has only the first one:
outscansl
{'version': 1,
'refs': {'.zgroup': '{\n "zarr_format": 2\n}',
'.zattrs': '{\n "Conventions": "CF-1.7",\n "GRIB_centre": "kwbc",\n "GRIB_centreDescription": "US National Weather Service - NCEP",\n "GRIB_edition": 2,\n "GRIB_subCentre": 2,\n "history": "2022-04-14T19:23 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {\\"source\\": \\"../../tmp/tmpyycaxsp5grib2\\", \\"filter_by_keys\\": {}, \\"encode_cf\\": [\\"parameter\\", \\"time\\", \\"geography\\", \\"vertical\\"]}",\n "institution": "US National Weather Service - NCEP"\n}',
'step/.zarray': '{\n "chunks": [],\n "compressor": null,\n "dtype": "<f8",\n "fill_value": 0.0,\n "filters": null,\n "order": "C",\n "shape": [],\n "zarr_format": 2\n}',
'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
'step/.zattrs': '{\n "_ARRAY_DIMENSIONS": [],\n "long_name": "time since forecast_reference_time",\n "standard_name": "forecast_period",\n "units": "hours"\n}',
'latitude/.zarray': '{\n "chunks": [\n 721\n ],\n "compressor": null,\n "dtype": "<f8",\n "fill_value": 0.0,\n "filters": [\n {\n "id": "grib",\n "var": "latitude"\n }\n ],\n "order": "C",\n "shape": [\n 721\n ],\n "zarr_format": 2\n}',
'latitude/0': ['{{u}}', 0, 325317],
'latitude/.zattrs': '{\n "_ARRAY_DIMENSIONS": [\n "latitude"\n ],\n "long_name": "latitude",\n "standard_name": "latitude",\n "stored_direction": "decreasing",\n "units": "degrees_north"\n}',
'longitude/.zarray': '{\n "chunks": [\n 1440\n ],\n "compressor": null,\n "dtype": "<f8",\n "fill_value": 0.0,\n "filters": [\n {\n "id": "grib",\n "var": "longitude"\n }\n ],\n "order": "C",\n "shape": [\n 1440\n ],\n "zarr_format": 2\n}',
'longitude/0': ['{{u}}', 29740164, 413329],
'longitude/.zattrs': '{\n "_ARRAY_DIMENSIONS": [\n "longitude"\n ],\n "long_name": "longitude",\n "standard_name": "longitude",\n "units": "degrees_east"\n}',
'acpcp/.zarray': '{\n "chunks": [\n 721,\n 1440\n ],\n "compressor": null,\n "dtype": "<f4",\n "fill_value": 9999.0,\n "filters": [\n {\n "id": "grib",\n "var": "acpcp"\n }\n ],\n "order": "C",\n "shape": [\n 721,\n 1440\n ],\n "zarr_format": 2\n}',
'acpcp/0.0': ['{{u}}', 29740164, 413329],
'acpcp/.zattrs': '{\n "GRIB_NV": 0,\n "GRIB_Nx": 1440,\n "GRIB_Ny": 721,\n "GRIB_cfName": "lwe_thickness_of_convective_precipitation_amount",\n "GRIB_cfVarName": "acpcp",\n "GRIB_dataType": "cf",\n "GRIB_gridDefinitionDescription": "Latitude/longitude. Also called equidistant cylindrical, or Plate Carree",\n "GRIB_gridType": "regular_ll",\n "GRIB_iDirectionIncrementInDegrees": 0.25,\n "GRIB_iScansNegatively": 0,\n "GRIB_jDirectionIncrementInDegrees": 0.25,\n "GRIB_jPointsAreConsecutive": 0,\n "GRIB_jScansPositively": 0,\n "GRIB_latitudeOfFirstGridPointInDegrees": 90.0,\n "GRIB_latitudeOfLastGridPointInDegrees": -90.0,\n "GRIB_longitudeOfFirstGridPointInDegrees": 0.0,\n "GRIB_longitudeOfLastGridPointInDegrees": 359.75,\n "GRIB_missingValue": 9999,\n "GRIB_name": "Convective precipitation (water)",\n "GRIB_numberOfPoints": 1038240,\n "GRIB_paramId": 3063,\n "GRIB_shortName": "acpcp",\n "GRIB_stepType": "accum",\n "GRIB_stepUnits": 1,\n "GRIB_totalNumber": 10,\n "GRIB_typeOfLevel": "surface",\n "GRIB_units": "kg m**-2",\n "_ARRAY_DIMENSIONS": [\n "latitude",\n "longitude"\n ],\n "coordinates": "number time step surface latitude longitude valid_time",\n "long_name": "Convective precipitation (water)",\n "standard_name": "lwe_thickness_of_convective_precipitation_amount",\n "units": "kg m**-2"\n}'},
'templates': {'u': 'acpcp_sfc_2000010100_c00.grib2'}}
so I thought that the issue was _store_array
that was overwriting things, but if I copy and paste the whole grib2.py
file and add some print
statemetns, it seems like _store_array
prints only the first step
value (which makes sense since it's what i have in the outscan variable) although it is within the loop of for var in common_vars:
🤷🏼♀️
Bottom line, I have no idea where the issue is.
Clearly, the _split_file
decides to generate one "message" per step
(why? why not latitude? the order of common_vars
doesn't matter I checked), but then it looses those parts somehow.
Sorry for the long issue, but I wanted to be as detailed as possible. Thanks for your time, Chiara