kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

scan_grib is not outputting all the pieces dimensions correctly.

Open chiaral opened this issue 2 years ago • 13 comments

I am very new at this, so I might not comprehend the exact goal of scan_grib but here we go (this is a follow up from discourse):

What's the issue: scan_grib doesn't outputs all the "parts" that I expect to see in the json files, hence when i read the file through xarray I am missing parts of it. Something is wrong in the writing of the parsing of the file that comes out of _split_file. I explored a bit, but couldn't find what the issue is.

grib_file to download: !wget https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2000/2000010100/c00/Days:1-10/acpcp_sfc_2000010100_c00.grib2

versions I am using: Screen Shot 2022-04-14 at 3 13 41 PM

import xarray as xr
import cfgrib

dsx = xr.open_dataset('acpcp_sfc_2000010100_c00.grib2', engine='cfgrib')
dscf = cfgrib.open_file('acpcp_sfc_2000010100_c00.grib2')

in both cases (which are essentially the same thing) the dimensions are recognized correctly, in particular:

dsx Screen Shot 2022-04-14 at 2 53 45 PM

dscf

Dataset(
dimensions={'step': 80, 'latitude': 721, 'longitude': 1440}, 
variables={'number': Variable(dimensions=(), data=0), 
'time': Variable(dimensions=(), data=946684800), 
'step': Variable(dimensions=('step',), data=array([  3.,   6.,   9.,  12.,  15.,  18.,  21.,  24.,  27.,  30.,  33.,
                    ...condensed..for...clarity    
       201., 204., 207., 210., 213., 216., 219., 222., 225., 228., 231.,
       234., 237., 240.])), 
'surface': Variable(dimensions=(), data=0.0),
'latitude': Variable(dimensions=('latitude',), data=array([ 90.  ,  89.75,  89.5 ,  89.25,  89.  ,  88.75,  88.5 ,  88.25,
        88.  ,  87.75,  87.5 ,  87.25,  87.  ,  86.75,  86.5 ,  86.25,
                   ...condensed..for...clarity    
       -88.  , -88.25, -88.5 , -88.75, -89.  , -89.25, -89.5 , -89.75,
       -90.  ])), 
'longitude': Variable(dimensions=('longitude',), data=array([0.0000e+00, 2.5000e-01, 5.0000e-01, ..., 3.5925e+02, 3.5950e+02,
       3.5975e+02])), 'valid_time': Variable(dimensions=('step',), data=array([9.466956e+08, 9.467064e+08, 9.467172e+08, 9.467280e+08,
                  ...condensed..for...clarity    
       9.475164e+08, 9.475272e+08, 9.475380e+08, 9.475488e+08])), 
'acpcp': Variable(dimensions=('step', 'latitude', 'longitude'), 
data=OnDiskArray(index=FileIndex(fieldset=FileStream(path='acpcp_sfc_2000010100_c00.grib2', errors='warn'), index_keys=['centre', 'centreDescription', 'dataDate', 'dataTime', 'dataType', 'directionNumber', 'edition', 'endStep', 'frequencyNumber', 'gridType', 'level:float', 'number', 'numberOfPoints', 'paramId', 'step', 'stepType', 'stepUnits', 'subCentre', 'time', 'typeOfLevel'], filter_by_keys={'paramId': 3063}, computed_keys={}, index_protocol_version='2'), shape=(80, 721, 1440), missing_value=9999))}, attributes={'GRIB_edition': 2, 'GRIB_centre': 'kwbc', 'GRIB_centreDescription': 'US National Weather Service - NCEP', 'GRIB_subCentre': 2, 'Conventions': 'CF-1.7', 'institution': 'US National Weather Service - NCEP', 'history': '2022-04-14T18:50 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {"source": "acpcp_sfc_2000010100_c00.grib2", "filter_by_keys": {}, "encode_cf": ["parameter", "time", "geography", "vertical"]}'}, encoding={'source': 'acpcp_sfc_2000010100_c00.grib2', 'filter_by_keys': {}, 'encode_cf': ('parameter', 'time', 'geography', 'vertical')})

Note: step is a dimension, just like latitude and longitude and time. time is particular, has no dimension and one data, step, just like latitude and longitude, has values in all the fields, again, in particular, i see no difference between:

'step': Variable(dimensions=('step',), data=array([  3.,   6.,   9.,  12.,  15.,  18.,  21.,  24.,  27.,  30.,  33.,
        ...condensed..for...clarity
       201., 204., 207., 210., 213., 216., 219., 222., 225., 228., 231.,
       234., 237., 240.])), 

and

'latitude': Variable(dimensions=('latitude',), data=array([ 90.  ,  89.75,  89.5 ,  89.25,  89.  ,  88.75,  88.5 ,  88.25,
        88.  ,  87.75,  87.5 ,  87.25,  87.  ,  86.75,  86.5 ,  86.25,
            ...condensed..for...clarity    
       -88.  , -88.25, -88.5 , -88.75, -89.  , -89.25, -89.5 , -89.75,
       -90.  ])), 

however, when I do:

!pip install kerchunk
import warnings
warnings.simplefilter("ignore")  # filter some warning messages
import xarray as xr
import hvplot.xarray
import datetime as dt
import pandas as pd
import dask
import panel as pn
import json
import fsspec
from kerchunk.grib2 import scan_grib
from kerchunk.combine import MultiZarrToZarr

and

so = {"anon": True, "default_cache_type": "readahead"}
outscan = scan_grib('acpcp_sfc_2000010100_c00.grib2',
                common_vars = ['time', 'step', 'latitude', 'longitude', 'valid_time'], storage_options=so )

it takes a second, and outscan looks like:

{'version': 1,
 'refs': {'.zgroup': '{\n    "zarr_format": 2\n}',
  '.zattrs': '{\n    "Conventions": "CF-1.7",\n    "GRIB_centre": "kwbc",\n    "GRIB_centreDescription": "US National Weather Service - NCEP",\n    "GRIB_edition": 2,\n    "GRIB_subCentre": 2,\n    "history": "2022-04-14T19:03 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {\\"source\\": \\"../../tmp/tmpq40ghn6agrib2\\", \\"filter_by_keys\\": {}, \\"encode_cf\\": [\\"parameter\\", \\"time\\", \\"geography\\", \\"vertical\\"]}",\n    "institution": "US National Weather Service - NCEP"\n}',
  'time/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<i8",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'time/0': 'base64:gENtOAAAAAA=',
  'time/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "calendar": "proleptic_gregorian",\n    "long_name": "initial time of forecast",\n    "standard_name": "forecast_reference_time",\n    "units": "seconds since 1970-01-01T00:00:00"\n}',
  'step/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  'step/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "long_name": "time since forecast_reference_time",\n    "standard_name": "forecast_period",\n    "units": "hours"\n}',
  'latitude/.zarray': '{\n    "chunks": [\n        721\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "latitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721\n    ],\n    "zarr_format": 2\n}',
  'latitude/0': ['{{u}}', 0, 325317],
  'latitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "latitude"\n    ],\n    "long_name": "latitude",\n    "standard_name": "latitude",\n    "stored_direction": "decreasing",\n    "units": "degrees_north"\n}',
  'longitude/.zarray': '{\n    "chunks": [\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "longitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        1440\n    ],\n    "zarr_format": 2\n}',
  'longitude/0': ['{{u}}', 0, 325317],
  'longitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "longitude"\n    ],\n    "long_name": "longitude",\n    "standard_name": "longitude",\n    "units": "degrees_east"\n}',
  'valid_time/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'valid_time/0': 'base64:AAAA2LY2zEE=',
  'valid_time/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "calendar": "proleptic_gregorian",\n    "long_name": "time",\n    "standard_name": "time",\n    "units": "seconds since 1970-01-01T00:00:00"\n}',
  'acpcp/.zarray': '{\n    "chunks": [\n        721,\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f4",\n    "fill_value": 9999.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "acpcp"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721,\n        1440\n    ],\n    "zarr_format": 2\n}',
  'acpcp/0.0': ['{{u}}', 29740164, 413329],
  'acpcp/.zattrs': '{\n    "GRIB_NV": 0,\n    "GRIB_Nx": 1440,\n    "GRIB_Ny": 721,\n    "GRIB_cfName": "lwe_thickness_of_convective_precipitation_amount",\n    "GRIB_cfVarName": "acpcp",\n    "GRIB_dataType": "cf",\n    "GRIB_gridDefinitionDescription": "Latitude/longitude. Also called equidistant cylindrical, or Plate Carree",\n    "GRIB_gridType": "regular_ll",\n    "GRIB_iDirectionIncrementInDegrees": 0.25,\n    "GRIB_iScansNegatively": 0,\n    "GRIB_jDirectionIncrementInDegrees": 0.25,\n    "GRIB_jPointsAreConsecutive": 0,\n    "GRIB_jScansPositively": 0,\n    "GRIB_latitudeOfFirstGridPointInDegrees": 90.0,\n    "GRIB_latitudeOfLastGridPointInDegrees": -90.0,\n    "GRIB_longitudeOfFirstGridPointInDegrees": 0.0,\n    "GRIB_longitudeOfLastGridPointInDegrees": 359.75,\n    "GRIB_missingValue": 9999,\n    "GRIB_name": "Convective precipitation (water)",\n    "GRIB_numberOfPoints": 1038240,\n    "GRIB_paramId": 3063,\n    "GRIB_shortName": "acpcp",\n    "GRIB_stepType": "accum",\n    "GRIB_stepUnits": 1,\n    "GRIB_totalNumber": 10,\n    "GRIB_typeOfLevel": "surface",\n    "GRIB_units": "kg m**-2",\n    "_ARRAY_DIMENSIONS": [\n        "latitude",\n        "longitude"\n    ],\n    "coordinates": "number time step surface latitude longitude valid_time",\n    "long_name": "Convective precipitation (water)",\n    "standard_name": "lwe_thickness_of_convective_precipitation_amount",\n    "units": "kg m**-2"\n}'},
 'templates': {'u': 'acpcp_sfc_2000010100_c00.grib2'}}

Note the part regarding step:

 'step/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  'step/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "long_name": "time since forecast_reference_time",\n    "standard_name": "forecast_period",\n    "units": "hours"\n}',

what you notice is that these are all empty:

  • "chunks": [],
  • "shape": [],
  • "filters": null,\n
  • 'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  • "_ARRAY_DIMENSIONS": []

differently, for latitude or longitude:

 'latitude/.zarray': '{\n    "chunks": [\n        721\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "latitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721\n    ],\n    "zarr_format": 2\n}',
  'latitude/0': ['{{u}}', 0, 325317],
  'latitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "latitude"\n    ],\n    "long_name": "latitude",\n    "standard_name": 

you have:

  • "chunks": [\n 721\n ],\n
  • "filters": [\n {\n "id": "grib",\n "var": "latitude"\n }\n ],\n
  • "shape": [\n 721\n ],
  • 'latitude/0': ['{{u}}', 0, 325317],
  • "_ARRAY_DIMENSIONS": [\n "latitude"\n ],\n

Ok - so I went and looked into grib2.py and I isolated the part where the reading of the variables defined as common_vars happens (link):

if common is False:
    # done for first valid message
    logger.debug("Common variables")
    z.attrs.update(ds.attributes)
    for var in common_vars:
        # assume grid, etc is the same across all messages
        attr = ds.variables[var].attributes or {}
        attr['_ARRAY_DIMENSIONS'] = ds.variables[var].dimensions
        _store_array(store, z, ds.variables[var].data, var, inline_threashold, offset, size,
                     attr)

If I simply do as below (for simplicity I limit common_vars to two, step and latitude):

dscf = cfgrib.open_file('acpcp_sfc_2000010100_c00.grib2')
for var in ['step', 'latitude', ]:
                        # assume grid, etc is the same across all messages
    print(var)
    print(dscf.variables[var].attributes)
    print(dscf.variables[var].dimensions)
    print(dscf.variables[var].data)

I get:

step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
('step',)
[  3.   6.   9.  12.  15.  18.  21.  24.  27.  30.  33.  36.  39.  42.
        ...condensed..for...clarity
 213. 216. 219. 222. 225. 228. 231. 234. 237. 240.]
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -87.5  -87.75 -88.   -88.25 -88.5  -88.75 -89.   -89.25 -89.5  -89.75
 -90.  ]

So the problem is NOT the grib2 file (which often times can be). If I copy and paste from grib2.py the relevant code to run _split_file

import tempfile
def _split_file(f, skip=0):
    if hasattr(f, "size"):
        size = f.size
    else:
        f.seek(0, 2)
        size = f.tell()
        f.seek(0)
    part = 0

    while f.tell() < size:
        # logger.debug(f"extract part {part + 1}")
        start = f.tell()
        f.seek(12, 1)
        part_size = int.from_bytes(f.read(4), "big")
        f.seek(start)
        data = f.read(part_size)
        assert data[:4] == b"GRIB"
        assert data[-4:] == b"7777"
        fn = tempfile.mktemp(suffix="grib2")
        with open(fn, "wb") as fo:
            fo.write(data)
        yield fn, start, part_size
        part += 1
        if skip and part > skip:
            break

and run:

import cfgrib
url = 'acpcp_sfc_2000010100_c00.grib2'
with fsspec.open(url, "rb") as f:
    for fn, offset, size in _split_file(f, skip=0):
        dscf = cfgrib.open_file(fn)
        for var in ['step', 'latitude', ]:
                        # assume grid, etc is the same across all messages
            print(var)
            print(dscf.variables[var].attributes)
            print(dscf.variables[var].dimensions)
            print(dscf.variables[var].data)

what I get is the same as in outscan above BUT I get one print out for each instance of step:

step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
3.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -87.5  -87.75 -88.   -88.25 -88.5  -88.75 -89.   -89.25 -89.5  -89.75
 -90.  ]
step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
6.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -90.  ]
step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
9.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -87.5  -87.75 -88.   -88.25 -88.5  -88.75 -89.   -89.25 -89.5  -89.75
 -90.  ]

however, once again, when I ran scan_grib (let's use only step and latitude as common vars):

so = {"anon": True, "default_cache_type": "readahead"}
outscansl = scan_grib('acpcp_sfc_2000010100_c00.grib2',
                common_vars = [ 'step', 'latitude', ], storage_options=so )

it will take a while (almost as it is indeed going through all the 80 steps of step) but the output has only the first one:

outscansl

{'version': 1,
 'refs': {'.zgroup': '{\n    "zarr_format": 2\n}',
  '.zattrs': '{\n    "Conventions": "CF-1.7",\n    "GRIB_centre": "kwbc",\n    "GRIB_centreDescription": "US National Weather Service - NCEP",\n    "GRIB_edition": 2,\n    "GRIB_subCentre": 2,\n    "history": "2022-04-14T19:23 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {\\"source\\": \\"../../tmp/tmpyycaxsp5grib2\\", \\"filter_by_keys\\": {}, \\"encode_cf\\": [\\"parameter\\", \\"time\\", \\"geography\\", \\"vertical\\"]}",\n    "institution": "US National Weather Service - NCEP"\n}',

  'step/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  'step/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "long_name": "time since forecast_reference_time",\n    "standard_name": "forecast_period",\n    "units": "hours"\n}',

  'latitude/.zarray': '{\n    "chunks": [\n        721\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "latitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721\n    ],\n    "zarr_format": 2\n}',
  'latitude/0': ['{{u}}', 0, 325317],
  'latitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "latitude"\n    ],\n    "long_name": "latitude",\n    "standard_name": "latitude",\n    "stored_direction": "decreasing",\n    "units": "degrees_north"\n}',
  'longitude/.zarray': '{\n    "chunks": [\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "longitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        1440\n    ],\n    "zarr_format": 2\n}',

  'longitude/0': ['{{u}}', 29740164, 413329],
  'longitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "longitude"\n    ],\n    "long_name": "longitude",\n    "standard_name": "longitude",\n    "units": "degrees_east"\n}',
  'acpcp/.zarray': '{\n    "chunks": [\n        721,\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f4",\n    "fill_value": 9999.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "acpcp"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721,\n        1440\n    ],\n    "zarr_format": 2\n}',
  'acpcp/0.0': ['{{u}}', 29740164, 413329],
  'acpcp/.zattrs': '{\n    "GRIB_NV": 0,\n    "GRIB_Nx": 1440,\n    "GRIB_Ny": 721,\n    "GRIB_cfName": "lwe_thickness_of_convective_precipitation_amount",\n    "GRIB_cfVarName": "acpcp",\n    "GRIB_dataType": "cf",\n    "GRIB_gridDefinitionDescription": "Latitude/longitude. Also called equidistant cylindrical, or Plate Carree",\n    "GRIB_gridType": "regular_ll",\n    "GRIB_iDirectionIncrementInDegrees": 0.25,\n    "GRIB_iScansNegatively": 0,\n    "GRIB_jDirectionIncrementInDegrees": 0.25,\n    "GRIB_jPointsAreConsecutive": 0,\n    "GRIB_jScansPositively": 0,\n    "GRIB_latitudeOfFirstGridPointInDegrees": 90.0,\n    "GRIB_latitudeOfLastGridPointInDegrees": -90.0,\n    "GRIB_longitudeOfFirstGridPointInDegrees": 0.0,\n    "GRIB_longitudeOfLastGridPointInDegrees": 359.75,\n    "GRIB_missingValue": 9999,\n    "GRIB_name": "Convective precipitation (water)",\n    "GRIB_numberOfPoints": 1038240,\n    "GRIB_paramId": 3063,\n    "GRIB_shortName": "acpcp",\n    "GRIB_stepType": "accum",\n    "GRIB_stepUnits": 1,\n    "GRIB_totalNumber": 10,\n    "GRIB_typeOfLevel": "surface",\n    "GRIB_units": "kg m**-2",\n    "_ARRAY_DIMENSIONS": [\n        "latitude",\n        "longitude"\n    ],\n    "coordinates": "number time step surface latitude longitude valid_time",\n    "long_name": "Convective precipitation (water)",\n    "standard_name": "lwe_thickness_of_convective_precipitation_amount",\n    "units": "kg m**-2"\n}'},
 'templates': {'u': 'acpcp_sfc_2000010100_c00.grib2'}}

so I thought that the issue was _store_array that was overwriting things, but if I copy and paste the whole grib2.py file and add some print statemetns, it seems like _store_array prints only the first step value (which makes sense since it's what i have in the outscan variable) although it is within the loop of for var in common_vars: 🤷🏼‍♀️

Bottom line, I have no idea where the issue is. Clearly, the _split_file decides to generate one "message" per step (why? why not latitude? the order of common_vars doesn't matter I checked), but then it looses those parts somehow.

Sorry for the long issue, but I wanted to be as detailed as possible. Thanks for your time, Chiara

chiaral avatar Apr 14 '22 19:04 chiaral