Non-kerchunk backend for GRIB files.
I did a bit of investigation into wrapping the existing kerchunk grib reader to create a GRIB backend but discovered that kerchunk forces inlining of derived coordinates that are not stored at the message level. Virtualizarr does not support reading inlined refs. This might be a good reason to kick off work on a Virtualizarr specific backend without a direct kerchunk dependency (which was an eventual goal).
I personally have limited experience working with GRIB internals so it would be valuable to get input here from someone with deeper experience like @mpiannucci. A few questions
- Should we consider https://github.com/mpiannucci/gribberish for
ChunkManifestgeneration (will it see continued development/maintenance)? - I'm leaning towards a model where we use
gribberishas an optional dependency inVirtualizarrand place the backend code in this project rather than generatingChunkManifestsviagribberishas is currently done for kerchunk refs viascan_gribberishbut I don't have strong opinions on this. - Can we assume coordinate alignment across all messages in a GRIB file and use
open_virtual_datatreeor should our we also include anopen_virtual_groupsmethod for problematic datasets? @mpiannucci we'd also probably need some recommendations for from you about documentation we can include on the types of concat and grouping operations users would perform on thedictreturned byopen_virtual_groups.
ref https://github.com/zarr-developers/VirtualiZarr/issues/11 ref https://github.com/zarr-developers/VirtualiZarr/issues/238
- I am happy to support building a gribberish backend for virutalizarr. I personally rely on gribberish for 3 different production apps so while it does not have the cf compliance of cfgrib, I am motivated to improve it tactically.
- I would make gribberish optional if you want to use it.
- The kerchunk backend for gribberish just flatmaps coordinates which is probably not what a lot of people want (I wanted to avoid data trees)
An alternative is to build a non kerhcunk cf_grib backend or update the kerchunk grib backend to grab the latitude and longitude from the codec. This is already supported but it is not how kerchunk works as default.
The reason that is works like this is because GRIB2 encodes coordinates and in many cases the coords are generated from metadata. So if you can, doign it up front and shoving it into bytes is smart. But you can just as easily force generation of the coord data from the codec, it doesnt even matter which grib message you ask for it, they will all be able to do so because every grib message includes all the metadata needed to give back the coordinates.
To summarize:
- I think the best case would be to build a simple wrapper around cfgrib to start, because it is the most compliant
- I support the idea of using gribberish, I can try to help as much as my time allows but i cannot promise anything
- Either way you choose, you can always use the Grib Codec to get the coordinates instead of always forcing them to be inline.
I hope this was helpful
Following up here, and noticing yesterday's updates - @maxrjones, is there a target date for v2.0? GRIB support here would be a big unblocker for a current project.
Hi @darothen - although we're planning to release v2.0 this week, that isn't slated to include GRIB support. However @sharkinsspatial has been working on a GRIB parser very recently, so he might be able to give you an estimate
@darothen This is a WIP experimental parser specifically for the HRRR hourly data https://github.com/virtual-zarr/hrrr-parser. Given my unfamiliarity with GRIB data we are taking the initial approach of building dataset specific parsers for GRIB data to simplify things a bit with the possibility of creating a more generalized one in the future.
Is there a specific GRIB dataset you are working with? We may be able to tackle a parser and codec for that as well if it aligns with some of our other work.
Hey @sharkinsspatial that's really cool - looking forward to trying that out sometime today!
The two datasets I'm most interested are the open GFS archive on GCS at gs://global-forecast-system (prefixes gfs.YYYYMMDD/HH/atmos/gfs.tHHz.pgrb2.0p25.fXXX and the IFS forecasts from the ECMWF open data archive on GCS at gs://ecmwf-open-data (prefixes YYYYMMDD/HHz/ifs/0p25/oper/YYYYMMDDHHmmss-X-oper-fx.grib2) where X is the forecast lead time in hours. There are other interesting products in the ECMWF open data archive (and it has a more "official" version on S3), but these are simpler for a proof case.
I work with these datasets daily; if there are additional details I can provide or sample code for working with them, let me know! One of our goals is to create a lazily-referenced version of these data archives, subset with (a) the fields required to initialize a typical MLWP forecast like GraphCast or PanguWeather, and (b) the fields required to run WeatherBench.
CC @drewbo
@darothen Perfect. GFS was next most request GRIB dataset for virtualization so hopefully I can carve out a bit of time to tackle that in the next weeks. If you do test the hrrr-parser any issues or feedback are greatly appreciated, I'm relatively unfamiliar with GRIB (and forecast data in general) so this is still quite experimental and can be tailored for more real world use cases.
Hi all, I am following this issue with a lot of interest. You are doing very cool stuff! I took inspiration from @sharkinsspatial data-specific parser and started working on a parser for the operational NWP data from MeteoSwiss. In my case, I opted for earthkit-data to scan the GRIB messages and extract the metadata. What I find quite neat is that we can control how this is done in a declarative way using xarray profiles, which allow to control how the xarray object is constructed (in other words, how GRIB metadata is translated to coords, dims, attrs). I believe it might be an option to consider for a parser, especially for
creating a more generalized one in the future
Will share more later, or we can have a chat on the slack space (I just joined) if you would like.
Very interesting @frazane ! Is EarthKit able to export the byte ranges? Do you have any code to share?
@TomNicholas here it is https://github.com/MeteoSwiss/icon-ch-vzarr. This is of course a very rough first implementation but the basic functionality is there.
Is EarthKit able to export the byte ranges?
earthkit-data creates a GribField object for each message, and you can easily extract information for the ChunkEntry. See here.