VirtualiZarr Use xarray's encode_cf / decode_cf functions to handle CF conventions

We might be able to replace some logic (especially that implemented in #156) with a call to xarray.decode_cf. This function can accept a Dataset and return a new dataset with different attributes and encoding etc. I think it's used internally in some of xarray's backends.

Also there is an idea to expose a corresponding xarray.encode_cf function in xarray, which we might also be able to use (see https://github.com/pydata/xarray/issues/4412).

Jun 24 '24 20:06 TomNicholas

I've been looking at this more, in the context of fixing #189.

I think we could probably almost use xarray.decode_cf out of the box, with a few subtleties:

Most of the actual operations of decode_cf can be turned off via flags, which is what we want, except for the BooleanCoder usage here
decode_cf includes a bunch of more complicated logic for deciding which variables are coords, and we really need that to be the same as in xarray to guard against issues like #189.
We don't want the wrapping by LazilyIndexedArray here

Aligning what we're doing in virtualizarr with this would be another reason to implement a virtualizarr xarray backend engine, see #35.

cc @ayushnag

Aug 17 '24 23:08 TomNicholas

I think the idea of an virtualizarr backend is appealing. One idea for how it can be implemented is by loading the actual data file and then also creating and storing the byte references in the virtual dataset. This way the dataset structure creation and data loading is handled by xarray, then the bytes are just an add on. This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.

As a side note, this might make the reference creation process (https://github.com/zarr-developers/VirtualiZarr/pull/87) simpler as well. Instead of searching for attrs, encoding, dimension names, etc the "chunk reader" only needs to create a low level chunk manifest (bytes, offset, path). The rest of the information is retrieved from the netcdf by xarray. Not sure if that is an actually time consuming part of reference creation however.

It would also add the ability to easily inline data (https://github.com/zarr-developers/VirtualiZarr/issues/62)

Basically the idea is that xr.open_dataset("data.nc", engine="virtualizarr") loads the netcdf file normally but then also reads byte ranges and creates ManifestArrays. Since I don't think it's possible to have two data arrays within one variable, perhaps all the data arrays will be replaced with ManifestArrays unless the top level params ask for data such as loadable_variables and cftime_variables

Aug 19 '24 20:08 ayushnag

Thanks @ayushnag . I think your suggestion is almost the opposite of what I'm suggesting doing in this issue, so I moved it to #221 to discuss there.

(Opposite in that I'm suggesting calling just part of xarray's open_dataset decoding logic from within open_virtual_dataset, whereas IIUC you are suggesting basically using all of xr.open_dataset and making it virtual as a final step.)

Aug 20 '24 17:08 TomNicholas

I think your suggestion makes sense and yes my suggestion was more related to the backend issue.

My only concern was that if the solution is to add this to open_virtual_dataset it can get tricky to essentially reimplement many parts of xarray and then also maintain them with upstream xarray changes. That's why I was suggesting opening the file and then adding references.

Although if decode_cf is the main portion left then the reimplementations are limited and your approach definitely makes more sense. Is the plan to add a modified copy of this function to virtualizarr?

Aug 20 '24 18:08 ayushnag

it can get tricky to essentially reimplement many parts of xarray and then also maintain them with upstream xarray changes.

I completely agree, which is why I've been looking into which parts of xarray we need. I think it's just decode_cf though (and actually only very limited parts of that).

Is the plan to add a modified copy of this function to virtualizarr?

Either that or make some small modifications to xarray upstream so that we can import and use xr.decode_cf directly. See also https://github.com/pydata/xarray/issues/4490, which talks about making xarray's interface for decoding clearer.

Aug 20 '24 18:08 TomNicholas