kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Support RAMS Model Output

Open lsterzinger opened this issue 3 years ago • 6 comments

So I'm going to start trying to put together a way to get reference maker to work with the output from the model I use (https://github.com/RAMSmodel/rams). This is probably a very fringe usecase, but if I get this working on my research data that means I can spend a bit more time working on reference maker and justify it as research work :wink:

There are a few things that might make this tricky.

phony dims

I added a patch to add phony dims if they weren't present already #45 (which fixed #41) but this seems to break on this dataset. Namely, the RAMS output has different dimensions for different variables. This wouldn't be a problem if 2D variables had dimension order ('x', 'y') and 3D had ('x', 'y', 'z'), meaning that via #45 'x' would always be assigned phony_dim_0, etc.

But the reverse is true for RAMS -- 2D variables are ordered ('x', 'y') and 3D variables are ordered ('z', 'x', 'y'), so in the current method phony_dim_0 is assigned to 'y' for a 2D variable and would be assigned to 'z' for a 3D variable. This wonky ordering has caused many headaches regarding compatibility with other software.

Reading the file header with ncdump, it's clear that netcdf is able to figure this out no problem:

netcdf ccn0-newnudge-A-2017-05-12-030000-g1 {
dimensions:
        phony_dim_0 = 96 ;
        phony_dim_1 = 96 ;
        phony_dim_2 = 200 ;
        phony_dim_3 = 2 ;
        phony_dim_4 = 1 ;
        phony_dim_5 = 5 ;
variables:
        float ACCPA(phony_dim_0, phony_dim_1) ;
        float ACCPD(phony_dim_0, phony_dim_1) ;
        float ACCPG(phony_dim_0, phony_dim_1) ;
        float ACCPH(phony_dim_0, phony_dim_1) ;
        float ACCPP(phony_dim_0, phony_dim_1) ;
        float ACCPR(phony_dim_0, phony_dim_1) ;
        float ACCPS(phony_dim_0, phony_dim_1) ;
        float AGGREGATET(phony_dim_2, phony_dim_0, phony_dim_1) ;
        float AGGRPRISSNOWT(phony_dim_2, phony_dim_0, phony_dim_1) ;
        float AGGRSELFPRIST(phony_dim_2, phony_dim_0, phony_dim_1) ;
        float AGGRSELFSNOWT(phony_dim_2, phony_dim_0, phony_dim_1) ;

And so I'm curious if there's a good way to replicate this behaviour in reference maker.

Time var/dimension

There is no time variable or dimension in RAMS output, the only place where the date/time is recorded is in the filename. I have created a hacky solution to this by running a file list through a glob pattern match and return a list of np.datetime64 -- https://github.com/lsterzinger/ramslibs/blob/4ec5af45e31a217399e9175f1c9c434cf71942ad/ramslibs/data_tools.py#L247-L282 .

I have created another function to add this to a correctly named dimension in an xarray dataset via https://github.com/lsterzinger/ramslibs/blob/4ec5af45e31a217399e9175f1c9c434cf71942ad/ramslibs/data_tools.py#L285-L346

Current error with fsspec-reference-maker

I have included a small sample data file data.zip (initial t=0 file, so most variables are 0). When I try and run this through reference maker, I get the following:

from fsspec_reference_maker.hdf import SingleHdf5ToZarr
import fsspec
import xarray as xr

f = './ccn0-newnudge-A-2017-05-12-030000-g1.h5'

with fsspec.open(f, 'rb') as inf:
    ref = SingleHdf5ToZarr(inf, f, inline_threshold=300).translate()

fs = fsspec.filesystem('reference', fo=ref, remote_protocol='file')
ds = xr.open_dataset(fs.get_mapper(""), engine='zarr')
ValueError: conflicting sizes for dimension 'phony_dim_0': length 200 on 'AGGREGATET' and length 96 on {'phony_dim_0': 'ACCPA', 'phony_dim_1': 'ACCPA'}

Which is due to the ordering of phony dims in #45

lsterzinger avatar Oct 06 '21 21:10 lsterzinger

I pushed some scratch code that I've been working on to https://github.com/intake/fsspec-reference-maker/pull/83 It includes a partially-completed reworking of the combine code that would address your problem, I think. Please have a look.

martindurant avatar Oct 07 '21 14:10 martindurant

#83 looks really cool and I'm interested in looking it over in more detail, but the problem I'm having isn't with the combine code, it's how HDF5 unnamed (phony) dims are being assigned. The way we're doing it currently is that dim[0] is 'phony_dim_0', dim[1] is 'phony_dim_1', etc.

RAMS output breaks this by prepending the vertical dimension, not appending, so what is called "phony_dim_0" by our naming scheme is actually a different dimension in different variables within the same file.

The NetCDF API is able properly recognize this so one variable might be interpreted as

float ACCPS(phony_dim_0, phony_dim_1) ;

and another

float AGGREGATET(phony_dim_2, phony_dim_0, phony_dim_1) ;

Whereas our method would name those dims 0, 1, 2 in that order for each variable

lsterzinger avatar Oct 07 '21 15:10 lsterzinger

Have a look at the attributes of the h5py variable object for each variable: probably the coordinate index values are in a special tag there (by index rather than name).

martindurant avatar Oct 07 '21 15:10 martindurant

Maybe I'm doing this wrong, but

f = h5py.File('./ccn0-newnudge-A-2017-05-12-030000-g1.h5')
var = f['AGGREGATET']
print(var.attrs.keys())

Returns an empty list

<KeysViewHDF5 []>

lsterzinger avatar Oct 07 '21 15:10 lsterzinger

Hm..., the details for netCDF are in there somewhere, it's just a case of finding them!

martindurant avatar Oct 07 '21 15:10 martindurant

I'll do some digging, thanks for pointing me in the right direction!

lsterzinger avatar Oct 07 '21 15:10 lsterzinger