xarray icon indicating copy to clipboard operation
xarray copied to clipboard

read ncml files to create multifile datasets

Open rabernat opened this issue 5 years ago • 13 comments

This issue was motivated by a recent conversation with @jdha regarding how they are preparing inputs for regional ocean models. They are currently using ncml with netcdf-java to consolidate and homogenize diverse data sources. But this approach doesn't play well with the xarray / dask stack.

ncml is standard developed by Unidata for use with their netCDF-java library:

NcML is an XML representation of netCDF metadata, (approximately) the header information one gets from a netCDF file with the "ncdump -h" command.

In addition to describing individual netCDF files, ncml can be used to annotate modifications to netCDF metadata (attributes, dimension names, etc.) and also to aggregate multiple files into a single logical dataset. This is what such an aggregation over an existing dimension looks like in ncml:

<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
  <aggregation dimName="time" type="joinExisting">
    <netcdf location="jan.nc" />
    <netcdf location="feb.nc" />
  </aggregation>
</netcdf>

Obviously this maps very well to xarray's concat operation. Similar aggregations can be defined that map to merge operations.

I think it would be great if we could support the ncml spec in xarray, allowing us to write code like

ds = xr.open_ncml('file.ncml')

This idea has been discussed before in #893. Perhaps it's time has finally come.

rabernat avatar Jan 22 '19 17:01 rabernat

+1 for adding this to xarray. to_ncml would also be nice to have.

shoyer avatar Jan 22 '19 18:01 shoyer

Any updates regarding this?

A while ago @rabernat mentioned that @dopplershift was potentially interested in working on implementing this feature in xarray in https://github.com/pangeo-data/esgf2xarray/issues/1#issuecomment-470707112

I am interested in helping out with getting this feature in xarray. I tried finding Python tools that provide NcML functionality and the ones I found namely:

  • ncml: https://github.com/ioos/ncml
  • pyncml: https://github.com/axiom-data-science/pyncml

seem to be outdated and unmaintained.

In the meantime, I've been experimenting with some basics of NcML: https://nbviewer.jupyter.org/github/NCAR/xncml/blob/master/docs/source/tutorial.ipynb

With guidance, input and feedback on what the API is expected to look like in xarray, I'd be more than happy to work on this moving forward

andersy005 avatar Apr 17 '19 19:04 andersy005

I haven't had any time to start on this (and I'm a few more weeks out), so feel free to take a cut. I'm not sure what @shoyer or @rabernat have in mind for API.

dopplershift avatar Apr 19 '19 03:04 dopplershift

I have not thought much about APIs yet.

shoyer avatar Apr 19 '19 04:04 shoyer

I'd like to revive this issue. We're increasingly using NcML aggregations within our THREDDS server to create "logical" datasets. This allows us to fix some non-CF-conforming metadata fields without changing files on disk (which would break syncing with ESGF nodes). More importantly, by aggregating multiple time periods, variables and realizations, we're able to create catalog entries for simulations instead of files, which we expect will greatly facilitate parsing catalog search results. We'd like to offer the same aggregation functionality outside of the THREDDS server. Ideally, this would be supported right from the netcdf-c library (see https://github.com/Unidata/netcdf-c/issues/1478), but an xarray NcML backend is the second best option. I also imagine that NcML files could be use as a clean mechanism to create Zarr/NCZarr objects ie: *.nc -> open_ncml -> xr.Dataset -> to_zarr -> Zarr store

@andersy005 In terms of API, I think the need is not so much to create or modify NcML files, but rather to return an xarray.Dataset from an NcML description. My understanding is that open_ncml would be a wrapper around open_mfdataset. My hope is that NcML-based xarray.Dataset objects would behave similarly whether they are created from files on disk through xarray.open_ncml('sim.ncml') or xarray.open_dataset('https://.../thredds/sim.ncml').

The THREDDS repo contains a number of unit tests that could be emulated to steer the Python implementation. My understanding is that getting this done could involve a fair amount of work, so I'd like to see who's interested in collaborating on this and maybe schedule a meeting to plan work for this year or the next.

huard avatar Sep 03 '20 14:09 huard

Thanks for reviving this @huard!

FWIW, I think it's best for this sort of utility to live in its own small standalone package, which I have referred to as "xarray-mergetool" in the past. NCML could be one special case of the things it could it. It would also be very useful for intake-esm.

We have also discussed this in https://github.com/NCAR/esm-collection-spec/issues/12

We should have some bandwidth to work on this over the next year via the pangeo-forge project.

rabernat avatar Sep 03 '20 14:09 rabernat

This just popped up in my inbox and reminded me of the conversation I had with @rabernathttps://github.com/rabernat a few years back at a DRAKKAR meeting in France.

I haven't really kept up with things since then, but 6+ years ago we modified one of our python tools to abstract the IO method from the user by using NCML files as input. Then either the mfdataset or the unidata Java Netcdf library was used to access local or remote data (single file, directory or aggregation). As there wasn't any native NCML parser in python, and we had limited time, we ended up using pyjniushttps://github.com/kivy/pyjnius to call the netcdf java class from python which gave us access to the directory scan, aggregation functions etc from the Java Library.... probably not the most efficient way - but we've been using it ever since. I don't have a huge amount of time (or expertise), but happy to get involved if I can.


From: Ryan Abernathey [email protected] Sent: 03 September 2020 15:47 To: pydata/xarray [email protected] Cc: Harle, James [email protected]; Mention [email protected] Subject: Re: [pydata/xarray] read ncml files to create multifile datasets (#2697)

Thanks for reviving this @huardhttps://github.com/huard!

FWIW, I think it's best for this sort of utility to live in its own small standalone package, which I have referred to as "xarray-mergetool" in the past. NCML could be one special case of the things it could it. It would also be very useful for intake-esm.

We have also discussed this in NCAR/esm-collection-spec#12https://github.com/NCAR/esm-collection-spec/issues/12

We should have some bandwidth to work on this over the next year via the pangeo-forge project.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pydata/xarray/issues/2697#issuecomment-686543493, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACN66WFQ43YO36IEE6NMMDDSD6UABANCNFSM4GRUVDBQ.

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. The National Oceanography Centre (NOC) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. NOC does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to NOC business are solely those of the author and do not represent the views of NOC.

jdha avatar Sep 03 '20 15:09 jdha

It's worth pointing out that you can create FileReferenceSystem JSON to accomplish many of the tasks we used to use NcML for:

  • create a single virtual dataset that points to a collection of files
  • modify dataset and variable attributes

It also has the nice feature that it makes your dataset faster to work with on the cloud because the map to the data is loaded in one shot!

rsignell-usgs avatar May 05 '21 15:05 rsignell-usgs

I've got a first draft that parses an NcML document and spits out an xarray.Dataset. It does not cover all the NcML syntax, but the essential elements are there.

It uses xsdata to parse the XML, using a datamodel automatically generated from the NcML 2-2 schema. I've scrapped test files from the netcdf-java repo to create a test suite.

Wondering what's the best place to host the code, tests and test data so others may give it a spin ?

huard avatar Jul 06 '22 21:07 huard

Maybe a separate project in xarray-contrib would make sense?

I would be reluctant to add this into Xarray proper if we need a new external dependency for reading XML files.

On Wed, Jul 6, 2022 at 2:37 PM David Huard @.***> wrote:

I've got a first draft that parses an NcML document and spits out an xarray.Dataset. It does not cover all the NcML syntax, but the essential elements are there.

It uses xsdata https://xsdata.readthedocs.io/en/latest/ to parse the XML, using a datamodel automatically generated from the NcML 2-2 schema. I've scrapped test files from the netcdf-java https://github.com/Unidata/netcdf-java repo to create a test suite.

Wondering what's the best place to host the code, tests and test data so others may give it a spin ?

— Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2697#issuecomment-1176775280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJJFVW32WV5YKZZP7KFVBTVSX4BZANCNFSM4GRUVDBQ . You are receiving this because you were mentioned.Message ID: @.***>

shoyer avatar Jul 06 '22 22:07 shoyer

Ok, another option would be to add that to xncml

@andersy005 What do you think ?

huard avatar Jul 06 '22 23:07 huard

Ok, another option would be to add that to xncml

@andersy005 What do you think ?

@huard, I haven't touched the codebase in that repo for three years 😃... So, I'm happy to transfer the xncml repo to xarray-contrib org and give you and anyone who wants access to it

andersy005 avatar Jul 06 '22 23:07 andersy005

@andersy005 Sounds good !

huard avatar Jul 07 '22 12:07 huard

Hi everyone, I've hit a problem where I need to read ncml to xarray, which brought me here... Just wondering if there are any updates regarding this?

p/s xncml is broken at the moment.

Thank you.

vietnguyengit avatar Nov 24 '22 13:11 vietnguyengit

I'd assume that xncml has never been released (there's an issue suggesting the release of version 0.1), so obviously there's no package on PyPI. You can try installing from github:

pip install git+https://github.com/xarray-contrib/xncml.git

to see if that gives you something to work with, otherwise I'd wait for any of the devs to get back to you (most likely in the issue you opened on the xncml repo)

keewis avatar Nov 24 '22 13:11 keewis

Thanks @keewis that's right, looks like they are still working on the docs, it was confusing.

vietnguyengit avatar Nov 24 '22 13:11 vietnguyengit

That's right. I just did a quick 0.1 release of xncml, most likely rough around the edges. Give it a spin. PRs most welcome.

@rabernat If you're happy with it, this issue can probably be closed.

huard avatar Nov 24 '22 14:11 huard

closing, since anything still missing should be feature requests for xncml

keewis avatar May 29 '23 13:05 keewis