Manipulation of coordinages do not materialize to kerchunk refs
@norlandrhagen and I just came across what we believe is a bug when I manually set variables as coordinates on a virtual dataset.
To recreate I am taking a single CMIP6 output file and virtualize it:
from virtualizarr import open_virtual_dataset
url = 's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_185001-186012.nc'
vds = open_virtual_dataset(url, indexes={}, reader_options={'storage_options':{'anon':True}})
vds
Works great, but there are some coordinates declared as variables (maybe this is related to #189? ). Either way if I try to correct this on the virtualized dataset everything seems fine
vds_modified = vds.set_coords(['latitude'])
vds_modified
Now I expected that these modifications would be saved when I materialize and reload the dataset
import xarray as xr
vds_modified.virtualize.to_kerchunk(
'testing.parquet', format="parquet"
)
import xarray as xr
ds_reopened = xr.open_dataset(
'testing.parquet',
engine='kerchunk',
backend_kwargs={
'storage_options':{"remote_options":{'anon':True}}
}
)
ds_reopened
but somehow I am getting another variable as a coordinate? Note that 'longitude' is now a coordinate all the sudden...
Note this is my attempt to simplify a more complex multi-file situation where we set all variables !='uo' as coordinates and the roundtripped xarray dataset did not reflect this at all. I am pretty confused about what is going on above, but hope that investigating this curious issue will clear up this bug entirely.
~~My suspicion here is that there is some logic that acts on variables that have identical dimensions? longitude and latitude do so.~~
Testing the same as above but modifying another variable:
vds_modified = vds.set_coords(['vertices_longitude'])
vds_modified.virtualize.to_kerchunk(
'testing.parquet', format="parquet"
)
ds_reopened = xr.open_dataset(
'testing.parquet',
engine='kerchunk',
backend_kwargs={
'storage_options':{"remote_options":{'anon':True}}
}
)
ds_reopened
and one more time
vds_modified = vds.set_coords(['lev_bnds'])
vds_modified.virtualize.to_kerchunk(
'testing.parquet', format="parquet"
)
ds_reopened = xr.open_dataset(
'testing.parquet',
engine='kerchunk',
backend_kwargs={
'storage_options':{"remote_options":{'anon':True}}
}
)
ds_reopened
Both give
which is the same output as above. It is also the same output if I do not modify the coordinates at all!
vds_modified = vds
vds_modified.virtualize.to_kerchunk(
'testing.parquet', format="parquet"
)
ds_reopened = xr.open_dataset(
'testing.parquet',
engine='kerchunk',
backend_kwargs={
'storage_options':{"remote_options":{'anon':True}}
}
)
ds_reopened
So I think this might be a combination of #189 and a broken correspondence between the data_variables/coordinates order of the virtual dataset in memory and the ref on disk (or the way xarray is reading that back in).
Coordinates don't exist in zarrs model, so when Xarray opens a zarr store (or a kerchunk references representation of one), my understanding of how it determines zarr arrays should be set as coordinates is that it
- Makes any 1D variable with the same name as it's only dimension into a coordinate,
- looks at a
'coordinates'attribute in the metadata, which is deleted upon opening, and re-added when saving using .to_zarr, - CF decoding can state that additional variables should be set as coordinates.
(would be great if you could confirm this @dcherian)
I believe right now VirtualiZarr handles (1) correctly, (2) has a bug (#189), and (3) it doesn't even try to do yet.
Ayush's PR just solves (2), but didn't get finished as it is without tests.
I tried to solve both (2) and (3) together in my PR by calling the same logic that Xarray uses when it does CF decoding. This is a bit of a rabbit hole though, and it would probably be better to just fix one thing at a time.
It would be great if one of you could pick up Ayush's (small!) PR and see if that solves your issue.
This is not a burning priority for the meeting as far as I can tell right now.
Def struggling to get stuff sorted for the ESGF meeting next week, but please ping me after if there is still a need!