kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

add to xarray backends docs

Open dcherian opened this issue 1 year ago • 5 comments

Now that there is a kerchunk engine, it'd be nice to mention it in the Xarray docs:

https://docs.xarray.dev/en/stable/user-guide/io.html

And maybe a super simple example in the tutorial repo: https://tutorial.xarray.dev that then directs people to a more in-depth cookbook?

dcherian avatar Jun 13 '24 17:06 dcherian

@Anu-Ra-g , would you like to do this?

martindurant avatar Jun 17 '24 14:06 martindurant

Yes. This will tutorial will be related to the zarr data format?

Anu-Ra-g avatar Jun 17 '24 14:06 Anu-Ra-g

will tutorial will be related to the zarr data format

It would show how to use engine="kerchunk" for a simple example reference set.

martindurant avatar Jun 17 '24 15:06 martindurant

In https://docs.xarray.dev/en/stable/user-guide/io.html I would just add a kerchunk section and link here may be. We just want people to know that it's possible and have a place to go to for more information.

dcherian avatar Jun 17 '24 15:06 dcherian

I've made this pull request, https://github.com/pydata/xarray/pull/9146

Anu-Ra-g avatar Jun 20 '24 10:06 Anu-Ra-g

I've done some performance testing for Kerchunk vs something the NCAS-CMS team have developed called CFA, and I've discovered something interesting about the Xarray-Kerchunk engine compared to the old get_mapper/open_zarr method and CFA. It looks like the data loading is happening at the point of slicing and not computing for the Kerchunk Engine?

image

dwest77a avatar Jul 23 '24 09:07 dwest77a

Can I see some minimal code to get this behaviour, please?

martindurant avatar Jul 23 '24 13:07 martindurant

For reference, the kerchunk backend essentially does

    m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options)
    return xr.open_dataset(m, engine="zarr", consolidated=False, **open_dataset_options)

Are you passing chunks= at all?

In general, I am happy to help try and get the best performance out of kerchunk and your implementation too - we have the same overall goal.

martindurant avatar Jul 23 '24 13:07 martindurant

I'm running individual cells in a Jupyter notebook to obtain the different values for those different sections:

%%time 
import xarray as xr
kfile = 'example_CMIP_file.json'
ds = xr.open_dataset(kfile, engine='kerchunk')

Wall time: 7.81 s

%%time
h1 = ds['huss'].sel(lat=slice(-60,0), lon=slice(80,180)).isel(time=slice(10000,12000)).mean(dim='time')

Wall time: 2.75 s

%%time
h2 = h1.compute()

Wall time: 0.8 ms

%%time
h2.plot()

Wall time: 984 ms

dwest77a avatar Jul 23 '24 13:07 dwest77a

Can I have the "example_CMIP_file.json" ?

martindurant avatar Jul 23 '24 13:07 martindurant

example_CMIP_file.json.zip

Uncompressed this file is 64MB, I can probably make a smaller version if needed.

This file has https:// links now so you should have no issues with accessing the data.

dwest77a avatar Jul 23 '24 13:07 dwest77a

For reference, the kerchunk backend essentially does

    m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options)
    return xr.open_dataset(m, engine="zarr", consolidated=False, **open_dataset_options)

Are you passing chunks= at all?

In general, I am happy to help try and get the best performance out of kerchunk and your implementation too - we have the same overall goal.

I will test to see if using open_dataset with the zarr engine gives a different result. I've considered m2 as:

m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options)
ds = xr.open_zarr(mapper, consolidated=False)

dwest77a avatar Jul 23 '24 13:07 dwest77a

Ah, I see these are local files - so not much testing I can do about that.

~Two~ Three recommendations:

  • the height, lat and lon arrays are very small, and should be inlined. The time array is bigger, but always loaded eagerly, so maybe should be inlined too. This will greatly improve the dataset open time.
  • the data chunksize is <100kB, so you would want to set chunks={...} where you increase the size by a decent integer factor along the axis you intend to analyse on, which will decrease the read latency. This is essential for remote (cloud) storage, might not make any difference for local files.
  • the JSON file isn't huge, but reading and decoding the JSON does take some time; so parquet reference storage might be better. This is only a micro-optimisation at this scale.

martindurant avatar Jul 23 '24 13:07 martindurant

For reference, the CFA conventions/specification can be found here: https://github.com/NCAS-CMS/cfa-conventions/blob/main/source/cfa.md and my developing implementation is here: https://github.com/dwest77a/CFAPyX

My module is just the CFA reader backend for Xarray which reads CFA-netCDF files like a normal file but with additional 'decoding' of aggregated variables into a Wrapper object that gets passed to dask, so the data is only fetched when the dask array is indexed. This is all just using the netCDF4 library so will only work with local files, or in the future with S3 to files on Object Store.

dwest77a avatar Jul 23 '24 13:07 dwest77a

FYI the variable h1 that I used to store the result of the sel/slice/mean operations includes a data variable that contains the sliced data. I may be wrong in my thinking but should that not happen until I've either run h1.compute() or h1.plot()? I'll see about the other Kerchunk access methods to see what happens with those.

dwest77a avatar Jul 23 '24 14:07 dwest77a

Correct, the data should not be touched (except the coordinates) until you ask for them - whether with xarray's in-built lazyness or backed by dask.

martindurant avatar Jul 23 '24 14:07 martindurant

With the use of get_mapper/open_zarr instead, the object h1.data is a dask.array<mean_agg-aggregate.

image

Using the Kerchunk Engine with the exact same slice/mean operation I get this:

image

Edit: Interestingly the Zarr engine xr.open_dataset(m, engine='zarr'...) shows the same as the Kerchunk engine, so it looks like both are loading the data at the point of performing the mean().

dwest77a avatar Jul 23 '24 14:07 dwest77a

As you can see in this notebook (at output 5) from our recent blog, in that case, the object definitely stays as an uncomputed dask array after a sel() call.

martindurant avatar Jul 23 '24 14:07 martindurant

It looks like at the moment it's either the mean() operation or isel() which triggers the data loading. Strangely, I can show an object after sel/isel that has a dask array representation, but then if I compute the mean() and pass that to a different object, the original object is also affected. See the screenshot below:

image

That second look at h2 where we can see the array does not happen when you use the old method without open_dataset. I think there must be something different in the Xarray backend for both the Kerchunk and Zarr engines where the mean is being applied, and is still tied to other objects, hence why h2 is also affected.

dwest77a avatar Jul 23 '24 14:07 dwest77a

@norlandrhagen , @TomNicholas knowing something about xarray engines, have you any idea how to explain what is described above? Why would calling mean() (no compute or access to .values) cause a compute of the underlying dask array?

martindurant avatar Jul 23 '24 14:07 martindurant

Why would calling mean() (no compute or access to .values) cause a compute of the underlying dask array?

It would not for a dask array. However, what you have is not a dask array but one of Xarray's internal lazy arrays. These will compute for any operation but indexing. I think you're missing a chunks={} in the open_dataset call.

dcherian avatar Jul 23 '24 16:07 dcherian

think you're missing a chunks={}

I mentioned chunks= too above; but some of the embedded array objects seem to be dask? Maybe not, and I was mistaken.

martindurant avatar Jul 23 '24 16:07 martindurant

Was this closed by the merging of https://github.com/pydata/xarray/pull/9163?

TomNicholas avatar Aug 07 '24 20:08 TomNicholas

It appears yes

martindurant avatar Aug 08 '24 02:08 martindurant