kerchunk
kerchunk copied to clipboard
potential performance improvements for GRIB files
I've been playing around a bit on reading GRIB files, but quickly became hit by the performance impact of the temporary files being created by kerchunk/grib2.py. Thus I tried to find ways around this. As far as I understood up to now, cfgrib
requires access to entire files and also requires some file-API while eccodes
is happy with in-memory grib messages as well. So I tried to read in grib files using mostly eccodes
and circumventing cfgrib
where possible, which is orders of magnitude faster than the current method implemented in kerchunk, but sadly, it doesn't do all the magic cfgrib
does in assembling proper datasets in all cases. This lack of generality is the reason why I'm not proposing a PR (yet?), but rather seek for further ideas on that topic:
- Do others work on this as well?
- Do you have ideas on how to do the dataset assembly more generically?
Here's how I'd implement the "decompression", which I belive is relatively generic (but may still be incompatible with what the current kerchunk-grib does):
import eccodes
import numcodecs
from numcodecs.compat import ndarray_copy, ensure_contiguous_ndarray
class RawGribCodec(numcodecs.abc.Codec):
codec_id = "rawgrib"
def encode(self, buf):
return buf
def decode(self, buf, out=None):
mid = eccodes.codes_new_from_message(bytes(buf))
try:
data = eccodes.codes_get_array(mid, "values")
finally:
eccodes.codes_release(mid)
if hasattr(data, "build_array"):
data = data.build_array()
if out is not None:
return ndarray_copy(data, out)
else:
return data
this gist shows how it may be possible to scan GRIB files without the need for temporary files
@TomAugspurger , can you please link here your "cogrib" experiments? It indeed scans files without first downloading them, and we plan to upstream it here. I'm not sure if it answers all your points, @d70-t , perhaps you have gone further.
Aside from this, it should also be possible to peek into the binary description of the data and directly find the buffers representing the main array of each message. This is assuming we can understand the encoding, which is a very likely yes. This would allow:
- somewhat smaller downloads on read (the main array normally dominates a message's size)
- no need to call cfgrib (or eccodes) to interpret the array and no need to create the codec. We may need a different codec, depending on how the array is actually encoded.
- no creation of coordinate arrays for every message read. This is pretty fast, but can cause a big memory spike in eccodes and is wholly redundant
That's at https://github.com/TomAugspurger/cogrib.
It indeed scans files without first downloading them, and we plan to upstream it here.
That cogrib
experiment does need to download the whole file when it's being "kerchunked". User's accessing through fsspec's reference filesystem don't need to download it, and it doesn't need a temporary file.
It'd be nice to avoid the temporary file for scanning too, but one of my desires was to match the output of cfgrib.
cogrib
looks very nice :+1:
And yes, the isssue with cfgrib
-compatibility is what bothers me most as well in my current attempt (I chose to drop compatibility for speed). I'd really hope we'd be able to figure out a way to do both: no temporary files and cfgrib compatibility.
Actually, can you please enlighten me what "compatibility" means here? I thought cfgrib was a pretty thin wrapper around eccodes.
As far as I understand GRIB (I'm really bad at this), GRIB doesn't know about dimensions and coordinates which are shared between arrays. GRIB files consist of messages (which are chunks + per chunk metadata) and nothing shared by those messages. cfgrib
guesses how to assemble those messages to a Dataset based on what it finds among the per-message metadata.
As always with guessing, there are multiple options on how you might want to do this and which kind of conventions are to be followed, so when rolling this guesswork on your own, you might end up with something different.
That matches what I mean by compatibility too. The output of kerchunking a GRIB file should be a list of datasets equal to what you get from cfgrib.open_datasets(file)
. I'll want to stretch that definition a bit to handle concatenating data from many GRIB files along time, but the basic idea is that I don't want to guess how to assemble messages into datasets.
From working previously on gribs, I do also want to add, that for some files, you cannot use open_datasets without appropriate filters being supplied, because of coordinates mismatch between messages.
Do you mean open_datasets
(plural) or open_dataset
(singular)? I don't think I've run into files where open_datasets
fails, but I haven't tried on too many different types of files.
Yes, single
We've been working a bit more on our gribscan
, which is now also available at gribscan/gribscan. It's still very fragile, deliberately doesn't care about being compatible to the output of cfgrib
and potentially requires users to implement their own Magician
.
Magician?? :)
Do you intend to integrate any of the work into or even replace grib2 in this repo? Do you have any comments on how it compares with @TomAugspurger 's cogrib
?
Note that with the latest version of numcodecs, you no longer need to import and register your codec, but can instead include install entrypoints.
:-) yes, we call the customization points a Magician because that's the part where users have to put their guesswork of how to assemble datasets to "magically" stuff the grib messsage together.
That's also the biggest difference to cogrib
: We do not try to have a universal tool which makes some dataset out of almost any GRIB. Instead we require customization to make the resulting dataset nicer. That's under the assumption that someone who is involved in creating the inital GRIBs might put some valuable knowledge into it.
The latest version of numcodecs isn't released yet... We've got the entrypoints set up, but they don't yet work 😬
Currently it works for some GRIBs, but is not really stable yet and we need to gain more experience... Thus we thought it might need a little time before we really want it in kerchunk.
@TomAugspurger , I'm sure your thoughts on gribscan would be appreciated, if you have the time to look.
The magician looks quite a lot like what happens in MultiZarrToZarr - if each of the messages of a grib were made into independent datasets, and combined with that, then maybe you wouldn't need your own mages. Sorry, sorcerers, ... er magicians.
Probably it would be possible to try to stuff some of the magicians into something like the coo_map
... I've to think more about that.
Initially we've had a design which built one dataset per grib-file and then put all of them into MultiZarrToZarr. We moved away from that design, because we needed something which looks at the collection of all individual messages. But we didn't come up with the idea to make datasets out of each individual message.