satpy icon indicating copy to clipboard operation
satpy copied to clipboard

kerchunk interop

Open martindurant opened this issue 1 year ago • 23 comments

Ref:

  • https://github.com/zarr-developers/VirtualiZarr/issues/216
  • https://github.com/fsspec/kerchunk/pull/494

Feature Request

I was recently sniped into considering HDF4 as a target for kerchunk, largely because of the amount of NASA archival data in the format. I blindly set about decoding the format and made a decent amount of progress (see linked PR) to the point of seeing the arrays and attributes in a file.

After this, it was mentioned that this repo has a lot of relevant code (along with the pyhdf helper), and indeed! You seem to have solved not only hdf4 but all the peculiarities of many more specific archive conventions.

So, how hard would it be to extract out kerchunk references given that you can already pipe dask chunks into xarray?

cc @maxrjones @TomNicholas

martindurant avatar Aug 27 '24 13:08 martindurant

Sorry for my ignorance, but what is kerchunk? what does it need?

mraspaud avatar Aug 27 '24 14:08 mraspaud

Regarding pyhdf, I recently looked into it again, and one problem is that is does not expose the chunking information of the underlying file (that's a choice when the implementation was done, see here). Other than that it seems to be working nicely for our needs.

mraspaud avatar Aug 27 '24 14:08 mraspaud

See https://github.com/pytroll/satpy/issues/2884 for recent discussion of when dask changes broke our pyhdf wrappers.

djhoese avatar Aug 27 '24 14:08 djhoese

but what is kerchunk

I should have started with this :) kerchunk allows zarr (and so xarray) to read any dataset where the binary buffers can be found and described. So, if we can infer that a file has an array (x, y) of dtype, and chunks (dx, dy), and also we know the offset/length and compression of each chunk, then we can construct a zarr array. The point of doing this, is

  • not to need another third party library once the scan has been done once (and posted somewhere as a file)
  • allow concurrent fetches of the chunks with fsspec and parallel processing of partitions with dask
  • allow access to any of the filesystems fsspec supports, making it "cloud native"
  • combine potentially thousands of original archival files into a single logical dataset, so that xarray slicing can select only the needed bytes out of the files for a given calculation

it does not expose the chunking information

I definitely got that far, associating chunk sets (offset/length/index) and dtype/compression with arrays! The struggle I have, it working out the grouping hierarchy - which dimensions belong to which variable.

martindurant avatar Aug 27 '24 14:08 martindurant

I definitely got that far, associating chunk sets (offset/length/index) and dtype/compression with arrays! The struggle I have, it working out the grouping hierarchy - which dimensions belong to which variable.

I'm not sure what you mean. Are you allowed to use a library like pyhdf to do the initial parsing? If so then h.select(<variable name>).dimensions() gives you the dimensions for that variable.

djhoese avatar Aug 27 '24 14:08 djhoese

Are you allowed to use a library like pyhdf

Certainly, I just didn't know about it when I got going. I also have an irrational itch to solve it all in one place in pure python, but ...

martindurant avatar Aug 27 '24 14:08 martindurant

See #2884 for recent discussion of when dask changes broke our pyhdf wrappers.

This was interesting. Would a pyhdf xarray backend be helpful for satpy even after the fix in https://github.com/pytroll/satpy/pull/2886? I think this would be helpful for xarray workflows outside of satpy (e.g., https://discourse.pangeo.io/t/reading-modis-thermal-anomalies-into-xarray/4437) and may be able to find some time to work on it.

it does not expose the chunking information

I definitely got that far, associating chunk sets (offset/length/index) and dtype/compression with arrays! The struggle I have, it working out the grouping hierarchy - which dimensions belong to which variable.

could this get implemented in pyhdf instead of kerchunk using SDgetchunkinfo()?

maxrjones avatar Aug 27 '24 14:08 maxrjones

Another thing to consider for parsing, if you're already using NetCDF libraries for parsing NetCDF (.nc) files, you could also use it to parse HDF4 files if the NetCDF C library was compiled with HDF4 support. This is the case for the conda-forge build of netcdf4.

djhoese avatar Aug 27 '24 14:08 djhoese

See #2884 for recent discussion of when dask changes broke our pyhdf wrappers.

This was interesting. Would a pyhdf xarray backend be helpful for satpy even after the fix in #2886? I think this would be helpful for xarray workflows outside of satpy (e.g., discourse.pangeo.io/t/reading-modis-thermal-anomalies-into-xarray/4437) and may be able to find some time to work on it.

I for one would love this.

it does not expose the chunking information

I definitely got that far, associating chunk sets (offset/length/index) and dtype/compression with arrays! The struggle I have, it working out the grouping hierarchy - which dimensions belong to which variable.

could this get implemented in pyhdf instead of kerchunk using SDgetchunkinfo()?

Yes, it absolutely could, but I got lost in the Swig interface and didn't know where to start to add it to pyhdf...

mraspaud avatar Aug 27 '24 14:08 mraspaud

Are you allowed to use a library like pyhdf

Certainly, I just didn't know about it when I got going. I also have an irrational itch to solve it all in one place in pure python, but ...

I definitely share that itch, but I'm always worried about performance...

mraspaud avatar Aug 27 '24 14:08 mraspaud

if you're already using NetCDF libraries for parsing NetCDF (.nc) files

I'm sure it can read them, but we want the chunks information rather than the data. For HDF5, it makes data access MUCH faster (depending on access pattern, of course), e.g., https://nbviewer.org/github/cgentemann/cloud_science/blob/master/zarr_meta/cloud_mur_v41_benchmark.ipynb

martindurant avatar Aug 27 '24 14:08 martindurant

I'm always worried about performance...

Seeking around a remote file to get all those 4-byte definitions will be poor performance no matter how you do it, so it seems to me that extracting what you need once and then reading it one-shot at access time shold be much faster. However, HDF4 does tend to tiny chunks (since the conventions predate the cloud era), so it would mean storing lots of chunks - even though they are probably contiguous in the file.

martindurant avatar Aug 27 '24 14:08 martindurant

I'm always worried about performance...

Seeking around a remote file to get all those 4-byte definitions will be poor performance no matter how you do it, so it seems to me that extracting what you need once and then reading it one-shot at access time shold be much faster.

Sorry for going off-topic, but that does not apply to real-time processing I suppose? where you just need to read the data once, process it, save the result, and in the end delete the data file? Not that this is a show stopper in anyway, just curious for my main use of data :)

mraspaud avatar Aug 27 '24 14:08 mraspaud

that does not apply to real-time processing

No. In that case, you may as well grab the whole file locally (in memory) and you end up scanning the bits of metadata just the once, and probably processing every byte of array data. Kerchunk allows scanning the metadata once and putting it into a convenient format, so that others don't have to. Perhaps there are scenarios where the file structure is byte-identical across some range, and so you can apply what you find in one to the others; that's unlikely to be true in the presence of compression.

martindurant avatar Aug 27 '24 14:08 martindurant

Seeking around a remote file to get all those 4-byte definitions will be poor performance no matter how you do it, so it seems to me that extracting what you need once and then reading it one-shot at access time shold be much faster.

Sorry for going off-topic, but that does not apply to real-time processing I suppose?

FYI I see kerchunk / virtualizarr as a form of caching. You're caching the result of the step where given a so-far-unseen file you have to seek through it to read the metadata and find the positions and lengths of the byte ranges for the chunks inside the file. On local filesystems this step is quick and so there isn't much benefit to caching, but on object storage there is no seek, and LIST and GET requests take long enough that it's well worth caching the result of that step.

With virtualizarr's design you can also think of this workflow as caching the result of xarray.open_mfdataset, therefore saving you the time that step can take on every subsequent access to the data.

TomNicholas avatar Aug 27 '24 15:08 TomNicholas

a form of caching

That's a decent way to describe it; but some (C) backend libraries won't read from remote or won't do concurrent/parallel reads at all. It wouldn't be too surprising if, for instance, pyhdf had global state that prevented threaded use; eccodes (for cfgrib) certainly does.

martindurant avatar Aug 27 '24 15:08 martindurant

After poking around the trial dataset (MOD14.A2024226.2345.061.2024227034233.hdf - I don't actually know which variant this is within satpy) with pyhdf, I found that my code actually did find exactly the right set of tags and hierarchy.

My question is, when opened with xarray/rasterio/gdal, I get coordinates

  * x            (x) float64 0.5 1.5 2.5 3.5 ... 1.352e+03 1.352e+03 1.354e+03
  * y            (y) float64 0.5 1.5 2.5 3.5 ... 2.028e+03 2.028e+03 2.03e+03

which are not arrays in the file, but clearly generated somehow. Is it just pixel centres??

martindurant avatar Aug 27 '24 15:08 martindurant

a form of caching

That's a decent way to describe it; but some (C) backend libraries won't read from remote or won't do concurrent/parallel reads at all. It wouldn't be too surprising if, for instance, pyhdf had global state that prevented threaded use; eccodes (for cfgrib) certainly does.

The remote and/or parallel reading is a good point. I don't think hdf4 supports remote reading, and I highly doubt any new development on the library would be done to support it. And we are using remote reading more and more, so it definitely makes sense to have it implemented for these files somehow.

mraspaud avatar Aug 28 '24 06:08 mraspaud

After poking around the trial dataset (MOD14.A2024226.2345.061.2024227034233.hdf - I don't actually know which variant this is within satpy) with pyhdf, I found that my code actually did find exactly the right set of tags and hierarchy.

My question is, when opened with xarray/rasterio/gdal, I get coordinates

  * x            (x) float64 0.5 1.5 2.5 3.5 ... 1.352e+03 1.352e+03 1.354e+03
  * y            (y) float64 0.5 1.5 2.5 3.5 ... 2.028e+03 2.028e+03 2.03e+03

which are not arrays in the file, but clearly generated somehow. Is it just pixel centres??

I don't really know what gdal does tbh, but the convention is pixel centres yes. That being said, I don't think this is something an hdf4 library should generate, that should be more on the user library side like rioxarray or satpy really.

mraspaud avatar Aug 28 '24 07:08 mraspaud

a form of caching

That's a decent way to describe it; but some (C) backend libraries won't read from remote or won't do concurrent/parallel reads at all. It wouldn't be too surprising if, for instance, pyhdf had global state that prevented threaded use; eccodes (for cfgrib) certainly does.

The remote and/or parallel reading is a good point.

I'm not sure I understand this point. Once you have the byte offsets and lengths, you don't need to use any specialized backend libraries to read the data, you just read those byte ranges (e.g. via http range requests to object storage). You can do that with as much parallelism as you like, because object storage doesn't have file locks.

TomNicholas avatar Aug 28 '24 20:08 TomNicholas

Once you have the byte offsets and lengths, you don't need to use any specialized backend libraries to read the data

Exactly true, but if you use one third-party library (like netcdf4), it very_probably acts on a file-like object with seek/read and serial, blocking access. Otherwise, you need another layer to know what to do with those bytes blocks. Only more modern IO libraries (like zarr!) know that not everything is a disk. In one case, fsspec does this offsets-finding in order to concurrently fetch many bytes, and can speed up reading of parquet by a decent factor in some cases - and that's for the modern arrow library.

martindurant avatar Aug 29 '24 13:08 martindurant

@martindurant a bit off topic, but how far did you get with that python-only hdf4 reader of yours? Is there a prototype somewhere i can play with?

mraspaud avatar Feb 28 '25 16:02 mraspaud

https://github.com/fsspec/kerchunk/blob/main/kerchunk/hdf4.py - it exists and works for some specific files, but is not very well tested at all.

martindurant avatar Feb 28 '25 16:02 martindurant