How to handle encoding
Encoding is the part of the virtualization problem that I have the least clear idea of how to handle using this library.
The discussion in #42 shows that currently to roundtrip an xarray dataset to disk as kerchunk then open it with fsspec requires ignoring encoding of the time variable. (In general time variables are going to be the most common offenders for encoding mistakes, as different files in the same dataset can easily end up with different encoding options for times.)
Part of the problem is that encoding is kind of a second-class citizen in xarray's data model. It happens magically behind the scenes in the backend file opener system (see https://github.com/pydata/xarray/issues/8548 for a great overview), and then is attached at the Variable level as var.encoding, but there aren't general rules for propagating it through operations (see https://github.com/pydata/xarray/issues/1614 and https://github.com/pydata/xarray/issues/6323). xr.concat will just keep the encoding of the first object passed, which for us could lead to an incorrect result.
There is also some ambiguity around whether certain types of encoding should be handled by xarray or zarr. For example scale and offset exist in the CF conventions and so are normally handled by xarray, but could also be handled by a Zarr codec. If all encoding were represented at the zarr level then virtual array concatenation (#5) could potentially solve this.
One practical workaround for a lot of cases would be to load the encoded variables into memory (i.e. "inline" them) - see #62. Then effectively xarray has decoded them, you do the concatenation in-memory, and re-encode them when you save back out to the new zarr store. For dimension coordinates you often might want to do this anyway.
One practical workaround for a lot of cases would be to load the encoded variables into memory (i.e. "inline" them)
See also https://github.com/TomNicholas/VirtualiZarr/pull/73#issuecomment-2042915164, which implies that kerchunk / fsspec doesn't actually do any decoding of times at all.
I think a lot of encoding challenges can be solved by pushing encoding down the stack, in to Zarr and out of Xarray. scale_factor / add_offset is a perfect minimal example of this. This can be implemented either as a user attribute, which trigger's Xarray encoding / decoding, or as a Zarr codec, in which case Zarr handles the encoding / decoding.
There are two obvious advantages to doing this at the Zarr level:
- You can easily read and write data directly to the Zarr array, bypassing Xarray, if you wish. Zarr users see the same data as Xarray users.
- Concating data with different encoding becomes identical to concating data with different codecs. Is this supported by VirtualiZarr or not? It requires a different abstraction than the one currently implemented by kerchunk.
Concating data with different encoding becomes identical to concating data with different codecs. Is this supported by VirtualiZarr or not? It requires a different abstraction than the one currently implemented by kerchunk.
Concatenating two zarr arrays with different codecs currently is not yet supported - it will currently raise an error, see https://github.com/TomNicholas/VirtualiZarr/issues/5#issuecomment-2048351956. It could be supported in future, but would require the virtual concatenation ZEP in zarr in order to serialize the resultant references (see #5).
It would be nice to factor out this encoding issue to be entirely handled at the level of virtual (/ zarr) array concatenation.
@TomNicholas Thanks for these links, this really helps with understanding the encoding handling and propagation landscape in xarray. I like @rabernat's rationales for pushing encoding considerations further down the stack into Zarr. I'm very new to understanding what the path forward for xarray will be but it seems like given the discussions around reducing the surface area of encoding that we might want to focus on pushing down encoding for maximum flexibility in the longer term.
Though most of this is totally reliant on upstream Zarr work, I do think we might have some practical considerations we can make now that might simplify things in the future. Thinking forward to https://github.com/TomNicholas/VirtualiZarr/issues/5 should we consider including encoding information at the ChunkEntry level rather than the array level? This way, if a VirtualZarrArray had mixed codecs, the zarr-python store would handle decoding at the individual chunk level using the codec appropriate for that chunk.
This might be premature, but was something I was considering as we think about how to handle encoding in the prototype hdf reader.
Thinking forward to https://github.com/TomNicholas/VirtualiZarr/issues/5 should we consider including encoding information at the
ChunkEntrylevel rather than the array level?
If we put encoding in the ChunkEntrys then serialization of the ManifestArray would require us extending the manifest.json format in the storage manifest transformer ZEP (https://github.com/zarr-developers/zarr-specs/issues/287) to store the encoding for each chunk.
The suggestions above to move encoding into the Zarr layer would have xarray encoding => zarr codecs, and therefore one set of encoding per xr.Variable/zarr.Array, not per-chunk.
This way, if a
VirtualZarrArrayhad mixed codecs, the zarr-python store would handle decoding at the individual chunk level using the codec appropriate for that chunk.
The proposal in https://github.com/zarr-developers/zarr-specs/issues/288 is to handle mixed codecs by still having one codec per zarr array, but introduce a higher-level array type (VirtualZarrArray) that points back to multiple zarr arrays.
might simplify things in the future
So in this roadmap the future-proof thing to do is avoid hacking mixed encodings into one Manifest.
Apologies if I'm explaining what you already know here @sharkinsspatial
Got it. I was totally misunderstanding the level of the VirtualZarrArray abstraction. 👍
To go back to the original issue, let me summarize where we are at with encoding.
Opening encoded variables as ManifestArrays works in the sense that they can be round-tripped, unless they have some kind of CF time encoding, which is why the round-tripping tests I added in #42 only work if you remove the time encoding first (see https://github.com/TomNicholas/VirtualiZarr/pull/42#discussion_r1536040324).
The solution to this is meant to be to load these encoded time variables and serialize them as bytes (see https://github.com/TomNicholas/VirtualiZarr/pull/42#issuecomment-2005070434). Whilst #69 implemented the ability to load selected variables as numpy arrays, and #73 implemented the ability to serialize them as bytes into the kerchunk references on-disk, this inlining does not yet work on variables that have any encoding (in the sense that it doesn't reproduce the same inlined bytes that calling kerchunk's SingleHdf5ToZarr would).
That means we currently:
a) Don't have support for inlining non-time variables that have encoding like scale_factor and offset
b) Don't have any way at all to virtualize time-encoded variables.
cc @dcherian
@TomNicholas I'm at a bit of decision point on decoding via numcodecs vs an Xarray CF convention approach. I pushed https://github.com/TomNicholas/VirtualiZarr/pull/87/commits/7f1c1897dcad92cb988ea7e14a165d63fe23dad6 which handles translating CF scale_factor and add_offset into a FixedScaleOffset codec (would love feedback from more knowledgeable Xarray folks if my logic is correct here). This requires removing the CF convention attrs and altering the Zarray's dtype so it is a fairly direct change. I haven't included robust testing here yet until we reach a decision, but though our roundtrip Xarray tests fail on assert_equal they pass with assert_allclose's default tolerances.
Do we have an idea which direction we'd like to go here? I also noticed @dcherian 's timely comment https://github.com/TomNicholas/VirtualiZarr/pull/122#discussion_r1609015818 around treating cftime as codec as well. I think we can strike a balance of using all of Xarray's excellent decoding intelligence (I'm taking direct advantage of the Xarray _choose_float_dtype) to drive our numcodec configuration at the reader level. This way, as @rabernat pointed out we can provide a seamless experience to Zarr and Xarray users.
I'll defer to you all here given that I'm really new to the wonderfully painful world of encoding :], but I'll probably press on with the pure numcodecs approach for the time being for the HDF reader.
Thanks for the update @sharkinsspatial !
I don't know a huge amount about the best way to do this either.
I think a lot of encoding challenges can be solved by pushing encoding down the stack, in to Zarr and out of Xarray.
@rabernat is there actually an xarray issue tracking this suggestion? If there is I can't find it. And what would this mean in practice? Adding a bunch of codecs to zarr-python (or numcodecs or somewhere else) that xarray then imports to use for the decoding step in its backends?
I'll probably press on with the pure numcodecs approach for the time being for the HDF reader.
I think it depends whether your priority is to just get something working as an alternative to kerchunk's hdf5 reader, or to work towards a longer-term but more separated and modular solution. Any changes to xarray tend to happen slower but allow bigger impact.
Part of this story is to delegate handling of CF conventions to xarray - see #157.
@sharkinsspatial suggested an interesting idea to aim for today: can we open data from a netCDF file using a zarr reader and roundtrip it? This would require the chunk manifest on-disk to point to the netCDF file, but also the necessary codecs to live somewhere that they can be utilized by the zarr reader. Perhaps that implies lifting them out of xarray and into a separate package?
@sharkinsspatial suggested an interesting idea to aim for today: can we open data from a netCDF file using a zarr reader and roundtrip it?
I just raised https://github.com/zarr-developers/zarr-specs/issues/303 to suggest this.
The rationale for this was to reach an eventual goal of having the xarray's CF decoding mechanisms represented completely as codecs so that non xarray clients using Virtualizarr stores could rely on the zarr-python's codec pipelines directly for decoding without the need for additional logic. We've had to temporarily pause work on this while some upstream changes around codecs are landed in Zarr 3 but we'll pick this up again in the near future (you can watch this space https://github.com/xarray-contrib/cf-codecs).
Originally posted by @sharkinsspatial in #670
+1. I hope that scale and offset can be pushed into a Zarr codec like it was in VirtualiZarr 1.0.
I came across this in this PR: https://github.com/zarrs/zarrs_icechunk/pull/5/files/92a091ed6102ff5ed38591b6e13538ee15e61771#r2220453003
Hi all, not sure if this is the right place to put this, or I can start a separate issue.
There maybe an issue with CF encoding for variables that have both the _FillValue attribute and scale / offset attributes. I'm finding when creating virtual datasets for these that the scale / offset seems to be applied "permanently" in the sense that even if I open my VDS in xarray with decode_cf=False, I don't retrieve the scale / offset attributes nor are the original values of (in my case sea surface temperature) retrieved. This is further causing issues because now the _FillValue is not being applied correctly (seems like it was supposed to be applied before the scale / offset?).
I couldn't find any sort of decode_cf kwarg to pass to open_virtual_dataset(), is there a similar functionality?
@DeanHenze Can you report what version of virtualizarr you are using? Under the hood open_virtual_dataset somewhat manually constructs the xarray constructs. In the case of decode_cf not affecting opened values, I suspect that we are simply not passing through the decode_cf arg to our internal xr.open_zarr call https://github.com/TomNicholas/VirtualiZarr/blob/main/virtualizarr/xarray.py#L347. I'll try to make an MVE to verify this 👍.
As for the scale and offset not being applied correctly when decode_cf is not passed we previously had a bug https://github.com/zarr-developers/VirtualiZarr/issues/670 with HDF parsing that has been fixed in the version 2 release. If you are using a previous release can you update to version 2 and report back your results 🙏 .
Thanks @sharkinsspatial, I'm using version 1.2.0. I can also try with 1.3.x if you think that would help.
v2.0.0 @DeanHenze ! Not 1. anything 😆
v2.0.0 @DeanHenze ! Not 1. anything 😆
Oh wow you all have been busy lately! So used to seeing 1.x.x that it didn't even register for me, woohoo! OK I'll give it a try. And if it works my bad I can delete this comment and all the other ones if you like, to not crowd up this thread. Thanks @TomNicholas and @sharkinsspatial
Better to just raise another issue instead of deleting anything here @DeanHenze