kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Fix for #248, #249, and #259, along with variable naming change

Open keltonhalbert opened this issue 2 years ago • 4 comments

About the Change

See #259 for discussion, and #249 for history of issue/changes being reverted or modified.

In addition to the issues with the dimension labels being wrong, this PR includes a change to variable names passed to xarray. Xarray and cfgrib tend to use the cfVarName if its available, and then falls back to the grib shortName if its unavailable or unknown. This change should mean that xarray variable names are consistent across kerchunk/cfgrib/xarray.

keltonhalbert avatar Dec 13 '22 17:12 keltonhalbert

So if this explicitly fixes the projections where lat and lon separate, do we also have a test case for where they are 2D lookup tables?

martindurant avatar Dec 13 '22 17:12 martindurant

So if this explicitly fixes the projections where lat and lon separate, do we also have a test case for where they are 2D lookup tables?

I think this should work for both. If my understanding isn't too terribly flawed, the HRRR grib2 files with the Lambert grid use 2D lookup tables, as latitude and longitude cannot be given as 1D array axes.

Edit: Would it be helpful to include a test pulling from the HRRR data on S3?

keltonhalbert avatar Dec 13 '22 17:12 keltonhalbert

Edit: Would it be helpful to include a test pulling from the HRRR data on S3?

In principle yes, and we do this for some HDF5 file. Of course, better if the target is small and so less likely to fail due to network timeouts.

martindurant avatar Dec 13 '22 21:12 martindurant

Edit: Would it be helpful to include a test pulling from the HRRR data on S3?

In principle yes, and we do this for some HDF5 file. Of course, better if the target is small and so less likely to fail due to network timeouts.

Awesome, I can work on implementing some tests for grib2. Looking at the grib test file, it looks like there isn't a whole lot going on. What are the priority checks you guys would like to see? https://github.com/fsspec/kerchunk/blob/f4159d0c5ed8f504aa7fb2972c4377afbaefbf10/kerchunk/tests/test_grib.py

For the 2D lat/lon grids (i.e. Lambert like HRRR), we can make sure their shapes are equal to the shapes of the other fields... but there are no 1D coordinates to check the lat/lon shapes against for non-LL data unless you look at the cfgrib Nx and Ny fields. Alternatively, with xarray/cfgrib, when parsing a grib file cfgrib will automatically add a coordinate X and Y dimension that is 1D, with values from 0 - Nx-1 (or Ny-1). Should we do the same within scan_grib in order to ensure the sanity of the dimensions? Or should the test be as simple as loading a file with scan_grib, and then loading a file with xarray/cfgrib, and then comparing the coordinates, coordinate lengths, and variable names?

As far as testing data, a standard HRRR grib file is ~140 MB. The tests could use scan_grib to load just a subset (one or two fields) from the dataset to do the checks on. Then perhaps do the same thing with a GFS file, and call it a day?

keltonhalbert avatar Dec 13 '22 23:12 keltonhalbert

scan_grib (a.k.a. GribToZarr) takes a skip= argument which means you can stop scanning after the given number of grib messages; so we can keep the download minimal like that. What I would do, is pick a particular real-valued point in lat/lon space in that first message using xarray via cfgrib directly (i.e., download the file!) and assert that loading the referenced version via zarr gives the same value at that lon/lat point.

martindurant avatar Dec 14 '22 14:12 martindurant

@keltonhalbert , do you think you can add a test as I suggest? I would contribute to this PR, but it is coming from your main branch, so I won't push there.

martindurant avatar Dec 22 '22 16:12 martindurant

I added tests in a new PR rather than step on this one. Please review #276

martindurant avatar Jan 02 '23 15:01 martindurant