VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Non-kerchunk backend for GRIB files.

Open sharkinsspatial opened this issue 1 year ago • 9 comments

I did a bit of investigation into wrapping the existing kerchunk grib reader to create a GRIB backend but discovered that kerchunk forces inlining of derived coordinates that are not stored at the message level. Virtualizarr does not support reading inlined refs. This might be a good reason to kick off work on a Virtualizarr specific backend without a direct kerchunk dependency (which was an eventual goal).

I personally have limited experience working with GRIB internals so it would be valuable to get input here from someone with deeper experience like @mpiannucci. A few questions

  1. Should we consider https://github.com/mpiannucci/gribberish for ChunkManifest generation (will it see continued development/maintenance)?
  2. I'm leaning towards a model where we use gribberish as an optional dependency in Virtualizarr and place the backend code in this project rather than generating ChunkManifests via gribberish as is currently done for kerchunk refs via scan_gribberish but I don't have strong opinions on this.
  3. Can we assume coordinate alignment across all messages in a GRIB file and use open_virtual_datatree or should our we also include an open_virtual_groups method for problematic datasets? @mpiannucci we'd also probably need some recommendations for from you about documentation we can include on the types of concat and grouping operations users would perform on the dict returned by open_virtual_groups.

ref https://github.com/zarr-developers/VirtualiZarr/issues/11 ref https://github.com/zarr-developers/VirtualiZarr/issues/238

sharkinsspatial avatar Nov 21 '24 18:11 sharkinsspatial

  1. I am happy to support building a gribberish backend for virutalizarr. I personally rely on gribberish for 3 different production apps so while it does not have the cf compliance of cfgrib, I am motivated to improve it tactically.
  2. I would make gribberish optional if you want to use it.
  3. The kerchunk backend for gribberish just flatmaps coordinates which is probably not what a lot of people want (I wanted to avoid data trees)

An alternative is to build a non kerhcunk cf_grib backend or update the kerchunk grib backend to grab the latitude and longitude from the codec. This is already supported but it is not how kerchunk works as default.

The reason that is works like this is because GRIB2 encodes coordinates and in many cases the coords are generated from metadata. So if you can, doign it up front and shoving it into bytes is smart. But you can just as easily force generation of the coord data from the codec, it doesnt even matter which grib message you ask for it, they will all be able to do so because every grib message includes all the metadata needed to give back the coordinates.

To summarize:

  1. I think the best case would be to build a simple wrapper around cfgrib to start, because it is the most compliant
  2. I support the idea of using gribberish, I can try to help as much as my time allows but i cannot promise anything
  3. Either way you choose, you can always use the Grib Codec to get the coordinates instead of always forcing them to be inline.

I hope this was helpful

mpiannucci avatar Nov 21 '24 21:11 mpiannucci

Following up here, and noticing yesterday's updates - @maxrjones, is there a target date for v2.0? GRIB support here would be a big unblocker for a current project.

darothen avatar Jul 14 '25 20:07 darothen

Hi @darothen - although we're planning to release v2.0 this week, that isn't slated to include GRIB support. However @sharkinsspatial has been working on a GRIB parser very recently, so he might be able to give you an estimate

TomNicholas avatar Jul 14 '25 20:07 TomNicholas

@darothen This is a WIP experimental parser specifically for the HRRR hourly data https://github.com/virtual-zarr/hrrr-parser. Given my unfamiliarity with GRIB data we are taking the initial approach of building dataset specific parsers for GRIB data to simplify things a bit with the possibility of creating a more generalized one in the future.

Is there a specific GRIB dataset you are working with? We may be able to tackle a parser and codec for that as well if it aligns with some of our other work.

sharkinsspatial avatar Aug 04 '25 17:08 sharkinsspatial

Hey @sharkinsspatial that's really cool - looking forward to trying that out sometime today!

The two datasets I'm most interested are the open GFS archive on GCS at gs://global-forecast-system (prefixes gfs.YYYYMMDD/HH/atmos/gfs.tHHz.pgrb2.0p25.fXXX and the IFS forecasts from the ECMWF open data archive on GCS at gs://ecmwf-open-data (prefixes YYYYMMDD/HHz/ifs/0p25/oper/YYYYMMDDHHmmss-X-oper-fx.grib2) where X is the forecast lead time in hours. There are other interesting products in the ECMWF open data archive (and it has a more "official" version on S3), but these are simpler for a proof case.

I work with these datasets daily; if there are additional details I can provide or sample code for working with them, let me know! One of our goals is to create a lazily-referenced version of these data archives, subset with (a) the fields required to initialize a typical MLWP forecast like GraphCast or PanguWeather, and (b) the fields required to run WeatherBench.

CC @drewbo

darothen avatar Aug 04 '25 17:08 darothen

@darothen Perfect. GFS was next most request GRIB dataset for virtualization so hopefully I can carve out a bit of time to tackle that in the next weeks. If you do test the hrrr-parser any issues or feedback are greatly appreciated, I'm relatively unfamiliar with GRIB (and forecast data in general) so this is still quite experimental and can be tailored for more real world use cases.

sharkinsspatial avatar Aug 04 '25 17:08 sharkinsspatial

Hi all, I am following this issue with a lot of interest. You are doing very cool stuff! I took inspiration from @sharkinsspatial data-specific parser and started working on a parser for the operational NWP data from MeteoSwiss. In my case, I opted for earthkit-data to scan the GRIB messages and extract the metadata. What I find quite neat is that we can control how this is done in a declarative way using xarray profiles, which allow to control how the xarray object is constructed (in other words, how GRIB metadata is translated to coords, dims, attrs). I believe it might be an option to consider for a parser, especially for

creating a more generalized one in the future

Will share more later, or we can have a chat on the slack space (I just joined) if you would like.

frazane avatar Aug 06 '25 19:08 frazane

Very interesting @frazane ! Is EarthKit able to export the byte ranges? Do you have any code to share?

TomNicholas avatar Aug 06 '25 20:08 TomNicholas

@TomNicholas here it is https://github.com/MeteoSwiss/icon-ch-vzarr. This is of course a very rough first implementation but the basic functionality is there.

Is EarthKit able to export the byte ranges?

earthkit-data creates a GribField object for each message, and you can easily extract information for the ChunkEntry. See here.

frazane avatar Aug 07 '25 09:08 frazane