VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Manipulation of coordinages do not materialize to kerchunk refs

Open jbusecke opened this issue 1 year ago • 3 comments

@norlandrhagen and I just came across what we believe is a bug when I manually set variables as coordinates on a virtual dataset.

To recreate I am taking a single CMIP6 output file and virtualize it:

from virtualizarr import open_virtual_dataset

url = 's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_185001-186012.nc'

vds = open_virtual_dataset(url, indexes={}, reader_options={'storage_options':{'anon':True}})
vds
image

Works great, but there are some coordinates declared as variables (maybe this is related to #189? ). Either way if I try to correct this on the virtualized dataset everything seems fine

vds_modified = vds.set_coords(['latitude'])
vds_modified
image

Now I expected that these modifications would be saved when I materialize and reload the dataset

import xarray as xr
vds_modified.virtualize.to_kerchunk(
    'testing.parquet', format="parquet"
)
import xarray as xr
ds_reopened = xr.open_dataset(
    'testing.parquet',
    engine='kerchunk',
    backend_kwargs={
        'storage_options':{"remote_options":{'anon':True}}
    }
)
ds_reopened

but somehow I am getting another variable as a coordinate? Note that 'longitude' is now a coordinate all the sudden...

image

Note this is my attempt to simplify a more complex multi-file situation where we set all variables !='uo' as coordinates and the roundtripped xarray dataset did not reflect this at all. I am pretty confused about what is going on above, but hope that investigating this curious issue will clear up this bug entirely.

jbusecke avatar Oct 29 '24 22:10 jbusecke

~~My suspicion here is that there is some logic that acts on variables that have identical dimensions? longitude and latitude do so.~~

Testing the same as above but modifying another variable:

vds_modified = vds.set_coords(['vertices_longitude'])
vds_modified.virtualize.to_kerchunk(
    'testing.parquet', format="parquet"
)
ds_reopened = xr.open_dataset(
    'testing.parquet',
    engine='kerchunk',
    backend_kwargs={
        'storage_options':{"remote_options":{'anon':True}}
    }
)
ds_reopened
image

and one more time

vds_modified = vds.set_coords(['lev_bnds'])
vds_modified.virtualize.to_kerchunk(
    'testing.parquet', format="parquet"
)
ds_reopened = xr.open_dataset(
    'testing.parquet',
    engine='kerchunk',
    backend_kwargs={
        'storage_options':{"remote_options":{'anon':True}}
    }
)
ds_reopened

Both give

image

which is the same output as above. It is also the same output if I do not modify the coordinates at all!

vds_modified = vds
vds_modified.virtualize.to_kerchunk(
    'testing.parquet', format="parquet"
)
ds_reopened = xr.open_dataset(
    'testing.parquet',
    engine='kerchunk',
    backend_kwargs={
        'storage_options':{"remote_options":{'anon':True}}
    }
)
ds_reopened

So I think this might be a combination of #189 and a broken correspondence between the data_variables/coordinates order of the virtual dataset in memory and the ref on disk (or the way xarray is reading that back in).

jbusecke avatar Oct 29 '24 22:10 jbusecke

Coordinates don't exist in zarrs model, so when Xarray opens a zarr store (or a kerchunk references representation of one), my understanding of how it determines zarr arrays should be set as coordinates is that it

  1. Makes any 1D variable with the same name as it's only dimension into a coordinate,
  2. looks at a 'coordinates' attribute in the metadata, which is deleted upon opening, and re-added when saving using .to_zarr,
  3. CF decoding can state that additional variables should be set as coordinates.

(would be great if you could confirm this @dcherian)

I believe right now VirtualiZarr handles (1) correctly, (2) has a bug (#189), and (3) it doesn't even try to do yet.

Ayush's PR just solves (2), but didn't get finished as it is without tests.

I tried to solve both (2) and (3) together in my PR by calling the same logic that Xarray uses when it does CF decoding. This is a bit of a rabbit hole though, and it would probably be better to just fix one thing at a time.

It would be great if one of you could pick up Ayush's (small!) PR and see if that solves your issue.

TomNicholas avatar Oct 29 '24 23:10 TomNicholas

This is not a burning priority for the meeting as far as I can tell right now.

Def struggling to get stuff sorted for the ESGF meeting next week, but please ping me after if there is still a need!

jbusecke avatar Oct 30 '24 01:10 jbusecke