xcdat
xcdat copied to clipboard
[Doc]: are there some xcdat test files (that can be predownloaded) ?
Describe your documentation update
I wonder if there are xCDAT (or xarray) test files that can be (pre)downloaded and can be used for :
- testing xCDAT with known (and local) files
- examples and tutorials
- having local data files that you can use when you have no network or low bandwidth
I'm thinking of (something like) the cdms2/vcs test data
I think these files are the ones listed in CDMS Sample Dataset and they are still online!
I like this idea, but I'm wondering how this be implemented in a way that is easy to maintain. Perhaps we could add some functionality to directly download (e.g., from ESGF) example netCDF files (e.g., xcdat.get_test_data())
?
I was curious about what xarray does – it seems like they generate toy data rather than providing data.
Should this be a discussion item?
This is the up-to-date link for toy data you mentioned, but I'd rather have data coming from actual netCDF files than toy data generated in memory!
Some not-too-big test data files could come from ESGF, the way I've done it in #284, but we also need a way to get other static/known test data files:
- subset (e.g a few time steps) of real ESGF data, because you don't want huge files with all the time steps when you have lots of time steps, or vertical levels. A script using xcdat to download and then save a subset of ESGF data (e.g first 10 time steps, and just a few pressure or depth levels of Northern Hemisphere) would be a useful example anyway
- data with some known errors (e. g. #284, or incorrectly masked data, or incorrect metadata, ...) that you want to be sure
xcdat
can handle, and also provide example scripts to show how to correct the files and save corrected files
I have just checked that cartopy mostly generates toy data on the fly for its examples, but iris uses a directory with actual data files (the way vcs
and cdms2
did)
>>> import iris
>>> help(iris.sample_data_path)
sample_data_path(*path_to_join)
Given the sample data resource, returns the full path to the file.
.. note::
This function is only for locating files in the iris sample data
collection (installed separately from iris). It is not needed or
appropriate for general file access.
>>> iris.sample_data_path("E1_north_america.nc")
'/home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/iris_sample_data/sample_data/E1_north_america.nc'
ls -lh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/iris_sample_data/sample_data/
total 24M
-rw-rw-r-- 2 jypeter lsce 110K Jun 25 2020 A1B.2098.pp
-rw-rw-r-- 2 jypeter lsce 1.8M Jun 25 2020 A1B_north_america.nc
-rw-rw-r-- 2 jypeter lsce 28K Jun 25 2020 air_temp.pp
-rw-rw-r-- 2 jypeter lsce 34K Jun 25 2020 atlantic_profiles.nc
-rw-rw-r-- 2 jypeter lsce 3.5M Jun 25 2020 colpex.pp
-rw-rw-r-- 2 jypeter lsce 110K Jun 25 2020 E1.2098.pp
-rw-rw-r-- 2 jypeter lsce 1.8M Jun 25 2020 E1_north_america.nc
drwxr-xr-x 2 jypeter lsce 4.0K Sep 10 2021 GloSea4/
-rw-rw-r-- 2 jypeter lsce 662K Jun 25 2020 hybrid_height.nc
-rw-rw-r-- 2 jypeter lsce 7.5M Jun 25 2020 NAME_output.txt
drwxr-xr-x 2 jypeter lsce 4.0K Sep 10 2021 NEMO/
-rw-rw-r-- 2 jypeter lsce 2.0M Jun 25 2020 orca2_votemper.nc
-rw-rw-r-- 2 jypeter lsce 1.7M Jun 25 2020 ostia_monthly.nc
-rw-rw-r-- 2 jypeter lsce 26K Jun 25 2020 polar_stereo.grib2
-rw-rw-r-- 2 jypeter lsce 110K Jun 25 2020 pre-industrial.pp
-rw-rw-r-- 2 jypeter lsce 19K Jun 25 2020 rotated_pole.nc
-rw-rw-r-- 2 jypeter lsce 163K Jun 25 2020 SOI_Darwin.nc
-rw-rw-r-- 2 jypeter lsce 243K Jun 25 2020 space_weather.nc
-rw-rw-r-- 2 jypeter lsce 514K Jun 25 2020 toa_brightness_stereographic.nc
-rw-rw-r-- 2 jypeter lsce 3.3M Jun 25 2020 uk_hires.pp
drwxr-xr-x 2 jypeter lsce 12K Sep 10 2021 UM/
-rw-rw-r-- 2 jypeter lsce 2.4K Jun 25 2020 wind_speed_lake_victoria.pp
Thanks for this @jypeter. This has been discussed and was in-mind, although a GH issue was not opened for it.
I explored a possible implementation similar to xarray. xarray uses a GH repo (https://github.com/pydata/xarray-data) to host test datasets, and provides xarray.tutorial
methods to open up the test datasets using a package called pooch
.
- https://github.com/fatiando/pooch
-
xarray.tutorial.open_dataset()
- Example usage: https://docs.xarray.dev/en/stable/examples/monthly-means.html#Open-the-Dataset
We didn't pursue this idea since xarray supports direct download of data using OpenDAP. However, I think this idea is worthwhile because it standardizes and streamlines the testing processes with easy access to the same real-world datasets.
Hmmm, I had a quick look at the pooch
GH page. It looks really nice and fancy but:
- it may be an overkill for our purpose, from the end user point-of-view. But xCDAT could indeed use it behind the scene! Or possibly just use requests
- specifying the input files seems a bit complicated, but it's OK if it only happens behind the scene. The end user should only have to specify a file name, and some xCDAT function should provide the path (either the directory where the file is located, or a full path)
- you have to be careful where the data files are located! I'm not too sure about a cache that usually depends on the user login or something. When, like me, you install a python distribution for multiple users (where the person installing can write, but other users can't), it's convenient to have files installed in a fixed sub directory of the distribution's
lib
directory. And I hate default cache locations in hidden sub-directories of the users' home dir. We have nightly backups of the the home dirs at LSCE, and we archive the interns' home dir when they are finished. I don't want to have backups of hidden test files! - See also the https://github.com/SciTools/cartopy/issues/1325 ongoing issue about file location and cache problems
Having a dedicated python package with just the data could also be an easy solution: e.g. basemap-data-hires
Another data sample example from xoa
>>> import xoa
>>> xoa.show_data_samples()
gdp-6203641.csv hycom.gdp.u.nc hycom.gdp.v.nc hycom.gdp.h.nc croco.south-africa.surf.nc hycom.cfg croco.cfg gdp.cfg mercator.cfg argo.cfg croco.south-africa.zonal.nc croco.south-africa.meridional.nc ibi-argo-7900573.nc argo-7900573.nc
>>> xoa.get_data_sample('hycom.gdp.u.nc')
'/home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples/hycom.gdp.u.nc'
> du -sh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples
1.1M /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples
>ls -lh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples
total 1.1M
-rw-rw-r-- 2 jypeter lsce 92K Feb 25 09:56 argo-7900573.nc
-rw-rw-r-- 2 jypeter lsce 305 Feb 25 09:56 argo.cfg
-rw-rw-r-- 2 jypeter lsce 714 Feb 25 09:56 croco.cfg
-rw-rw-r-- 2 jypeter lsce 61K Feb 25 09:56 croco.south-africa.meridional.nc
-rw-rw-r-- 2 jypeter lsce 190K Feb 25 09:56 croco.south-africa.surf.nc
-rw-rw-r-- 2 jypeter lsce 61K Feb 25 09:56 croco.south-africa.zonal.nc
-rw-rw-r-- 2 jypeter lsce 43K Feb 25 09:56 gdp-6203641.csv
-rw-rw-r-- 2 jypeter lsce 73 Feb 25 09:56 gdp.cfg
-rw-rw-r-- 2 jypeter lsce 487 Feb 25 09:56 hycom.cfg
-rw-rw-r-- 2 jypeter lsce 174K Feb 25 09:56 hycom.gdp.h.nc
-rw-rw-r-- 2 jypeter lsce 173K Feb 25 09:56 hycom.gdp.u.nc
-rw-rw-r-- 2 jypeter lsce 173K Feb 25 09:56 hycom.gdp.v.nc
-rw-rw-r-- 2 jypeter lsce 71K Feb 25 09:56 ibi-argo-7900573.nc
-rw-rw-r-- 2 jypeter lsce 195 Feb 25 09:56 mercator.cfg
@tomvothecoder was there a plan to have a test suite with just the kind of (few timesteps) data that @jypeter was describing? It seems that CDAT
was using the sample_data
subdir which enabled testing in the CI envs, similar to what iris
appears to do (https://github.com/xCDAT/xcdat/issues/277#issuecomment-1199068571 above)
Note: see example usage of vcs.sample_data + '/tas_mo.nc'
in https://github.com/xCDAT/xcdat/issues/310#issuecomment-1212866276
I have added an Easy to use datasets section to my python page, with test/tutorials datasets from several packages
@tomvothecoder It seems that xarray uses xarray.tutorial.load_dataset. Maybe xcdat could have a similar xcdat.tutorial.load_dataset
pointing to some useful sample CMIP6 data (and possibly the equivalent CMIP5 data, if somebody wants to make a CMIP5/CMIP6 comparison example)
We need to revisit this issue because we use OPeNDAP for E3SM data in our gallery notebooks. OPeNDAP might not be supported anymore by ESGF2 in the near future, specifically at ANL where E3SM data will be hosted.
Options:
- xarray-data and xarray.tutorial.load_dataset
- PCMDI Metrics Package demo data from webserver