xarray
xarray copied to clipboard
read ncml files to create multifile datasets
This issue was motivated by a recent conversation with @jdha regarding how they are preparing inputs for regional ocean models. They are currently using ncml with netcdf-java to consolidate and homogenize diverse data sources. But this approach doesn't play well with the xarray / dask stack.
ncml is standard developed by Unidata for use with their netCDF-java library:
NcML is an XML representation of netCDF metadata, (approximately) the header information one gets from a netCDF file with the "ncdump -h" command.
In addition to describing individual netCDF files, ncml can be used to annotate modifications to netCDF metadata (attributes, dimension names, etc.) and also to aggregate multiple files into a single logical dataset. This is what such an aggregation over an existing dimension looks like in ncml:
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation dimName="time" type="joinExisting">
<netcdf location="jan.nc" />
<netcdf location="feb.nc" />
</aggregation>
</netcdf>
Obviously this maps very well to xarray's concat
operation. Similar aggregations can be defined that map to merge
operations.
I think it would be great if we could support the ncml spec in xarray, allowing us to write code like
ds = xr.open_ncml('file.ncml')
This idea has been discussed before in #893. Perhaps it's time has finally come.
+1 for adding this to xarray. to_ncml
would also be nice to have.
Any updates regarding this?
A while ago @rabernat mentioned that @dopplershift was potentially interested in working on implementing this feature in xarray in https://github.com/pangeo-data/esgf2xarray/issues/1#issuecomment-470707112
I am interested in helping out with getting this feature in xarray. I tried finding Python tools that provide NcML functionality and the ones I found namely:
- ncml: https://github.com/ioos/ncml
- pyncml: https://github.com/axiom-data-science/pyncml
seem to be outdated and unmaintained.
In the meantime, I've been experimenting with some basics of NcML: https://nbviewer.jupyter.org/github/NCAR/xncml/blob/master/docs/source/tutorial.ipynb
With guidance, input and feedback on what the API is expected to look like in xarray, I'd be more than happy to work on this moving forward
I haven't had any time to start on this (and I'm a few more weeks out), so feel free to take a cut. I'm not sure what @shoyer or @rabernat have in mind for API.
I have not thought much about APIs yet.
I'd like to revive this issue.
We're increasingly using NcML aggregations within our THREDDS server to create "logical" datasets. This allows us to fix some non-CF-conforming metadata fields without changing files on disk (which would break syncing with ESGF nodes). More importantly, by aggregating multiple time periods, variables and realizations, we're able to create catalog entries for simulations instead of files, which we expect will greatly facilitate parsing catalog search results. We'd like to offer the same aggregation functionality outside of the THREDDS server.
Ideally, this would be supported right from the netcdf-c library (see https://github.com/Unidata/netcdf-c/issues/1478), but an xarray
NcML backend is the second best option. I also imagine that NcML files could be use as a clean mechanism to create Zarr/NCZarr objects ie:
*.nc -> open_ncml -> xr.Dataset -> to_zarr -> Zarr store
@andersy005 In terms of API, I think the need is not so much to create or modify NcML files, but rather to return an xarray.Dataset
from an NcML description. My understanding is that open_ncml
would be a wrapper around open_mfdataset
. My hope is that NcML-based xarray.Dataset
objects would behave similarly whether they are created from files on disk through xarray.open_ncml('sim.ncml')
or xarray.open_dataset('https://.../thredds/sim.ncml')
.
The THREDDS repo contains a number of unit tests that could be emulated to steer the Python implementation. My understanding is that getting this done could involve a fair amount of work, so I'd like to see who's interested in collaborating on this and maybe schedule a meeting to plan work for this year or the next.
Thanks for reviving this @huard!
FWIW, I think it's best for this sort of utility to live in its own small standalone package, which I have referred to as "xarray-mergetool" in the past. NCML could be one special case of the things it could it. It would also be very useful for intake-esm.
We have also discussed this in https://github.com/NCAR/esm-collection-spec/issues/12
We should have some bandwidth to work on this over the next year via the pangeo-forge project.
This just popped up in my inbox and reminded me of the conversation I had with @rabernathttps://github.com/rabernat a few years back at a DRAKKAR meeting in France.
I haven't really kept up with things since then, but 6+ years ago we modified one of our python tools to abstract the IO method from the user by using NCML files as input. Then either the mfdataset or the unidata Java Netcdf library was used to access local or remote data (single file, directory or aggregation). As there wasn't any native NCML parser in python, and we had limited time, we ended up using pyjniushttps://github.com/kivy/pyjnius to call the netcdf java class from python which gave us access to the directory scan, aggregation functions etc from the Java Library.... probably not the most efficient way - but we've been using it ever since. I don't have a huge amount of time (or expertise), but happy to get involved if I can.
From: Ryan Abernathey [email protected] Sent: 03 September 2020 15:47 To: pydata/xarray [email protected] Cc: Harle, James [email protected]; Mention [email protected] Subject: Re: [pydata/xarray] read ncml files to create multifile datasets (#2697)
Thanks for reviving this @huardhttps://github.com/huard!
FWIW, I think it's best for this sort of utility to live in its own small standalone package, which I have referred to as "xarray-mergetool" in the past. NCML could be one special case of the things it could it. It would also be very useful for intake-esm.
We have also discussed this in NCAR/esm-collection-spec#12https://github.com/NCAR/esm-collection-spec/issues/12
We should have some bandwidth to work on this over the next year via the pangeo-forge project.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pydata/xarray/issues/2697#issuecomment-686543493, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACN66WFQ43YO36IEE6NMMDDSD6UABANCNFSM4GRUVDBQ.
This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. The National Oceanography Centre (NOC) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. NOC does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to NOC business are solely those of the author and do not represent the views of NOC.
It's worth pointing out that you can create FileReferenceSystem JSON to accomplish many of the tasks we used to use NcML for:
- create a single virtual dataset that points to a collection of files
- modify dataset and variable attributes
It also has the nice feature that it makes your dataset faster to work with on the cloud because the map to the data is loaded in one shot!
I've got a first draft that parses an NcML document and spits out an xarray.Dataset
. It does not cover all the NcML syntax, but the essential elements are there.
It uses xsdata to parse the XML, using a datamodel automatically generated from the NcML 2-2 schema. I've scrapped test files from the netcdf-java repo to create a test suite.
Wondering what's the best place to host the code, tests and test data so others may give it a spin ?
Maybe a separate project in xarray-contrib would make sense?
I would be reluctant to add this into Xarray proper if we need a new external dependency for reading XML files.
On Wed, Jul 6, 2022 at 2:37 PM David Huard @.***> wrote:
I've got a first draft that parses an NcML document and spits out an xarray.Dataset. It does not cover all the NcML syntax, but the essential elements are there.
It uses xsdata https://xsdata.readthedocs.io/en/latest/ to parse the XML, using a datamodel automatically generated from the NcML 2-2 schema. I've scrapped test files from the netcdf-java https://github.com/Unidata/netcdf-java repo to create a test suite.
Wondering what's the best place to host the code, tests and test data so others may give it a spin ?
— Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2697#issuecomment-1176775280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJJFVW32WV5YKZZP7KFVBTVSX4BZANCNFSM4GRUVDBQ . You are receiving this because you were mentioned.Message ID: @.***>
Ok, another option would be to add that to xncml
@andersy005 What do you think ?
@huard, I haven't touched the codebase in that repo for three years 😃... So, I'm happy to transfer the xncml repo to xarray-contrib org and give you and anyone who wants access to it
@andersy005 Sounds good !
Hi everyone, I've hit a problem where I need to read ncml
to xarray
, which brought me here... Just wondering if there are any updates regarding this?
p/s xncml is broken at the moment.
Thank you.
I'd assume that xncml
has never been released (there's an issue suggesting the release of version 0.1), so obviously there's no package on PyPI. You can try installing from github:
pip install git+https://github.com/xarray-contrib/xncml.git
to see if that gives you something to work with, otherwise I'd wait for any of the devs to get back to you (most likely in the issue you opened on the xncml
repo)
Thanks @keewis that's right, looks like they are still working on the docs
, it was confusing.
That's right. I just did a quick 0.1 release of xncml, most likely rough around the edges. Give it a spin. PRs most welcome.
@rabernat If you're happy with it, this issue can probably be closed.
closing, since anything still missing should be feature requests for xncml