xcdat icon indicating copy to clipboard operation
xcdat copied to clipboard

[Doc]: are there some xcdat test files (that can be predownloaded) ?

Open jypeter opened this issue 2 years ago • 9 comments

Describe your documentation update

I wonder if there are xCDAT (or xarray) test files that can be (pre)downloaded and can be used for :

  • testing xCDAT with known (and local) files
  • examples and tutorials
  • having local data files that you can use when you have no network or low bandwidth

I'm thinking of (something like) the cdms2/vcs test data

I think these files are the ones listed in CDMS Sample Dataset and they are still online!

jypeter avatar Jul 22 '22 14:07 jypeter

I like this idea, but I'm wondering how this be implemented in a way that is easy to maintain. Perhaps we could add some functionality to directly download (e.g., from ESGF) example netCDF files (e.g., xcdat.get_test_data())?

I was curious about what xarray does – it seems like they generate toy data rather than providing data.

Should this be a discussion item?

pochedls avatar Jul 28 '22 21:07 pochedls

This is the up-to-date link for toy data you mentioned, but I'd rather have data coming from actual netCDF files than toy data generated in memory!

Some not-too-big test data files could come from ESGF, the way I've done it in #284, but we also need a way to get other static/known test data files:

  • subset (e.g a few time steps) of real ESGF data, because you don't want huge files with all the time steps when you have lots of time steps, or vertical levels. A script using xcdat to download and then save a subset of ESGF data (e.g first 10 time steps, and just a few pressure or depth levels of Northern Hemisphere) would be a useful example anyway
  • data with some known errors (e. g. #284, or incorrectly masked data, or incorrect metadata, ...) that you want to be sure xcdat can handle, and also provide example scripts to show how to correct the files and save corrected files

I have just checked that cartopy mostly generates toy data on the fly for its examples, but iris uses a directory with actual data files (the way vcs and cdms2 did)

>>> import iris
>>> help(iris.sample_data_path)
sample_data_path(*path_to_join)
    Given the sample data resource, returns the full path to the file.

    .. note::

        This function is only for locating files in the iris sample data
        collection (installed separately from iris). It is not needed or
        appropriate for general file access.

>>> iris.sample_data_path("E1_north_america.nc")
'/home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/iris_sample_data/sample_data/E1_north_america.nc'

ls -lh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/iris_sample_data/sample_data/
total 24M
-rw-rw-r-- 2 jypeter lsce 110K Jun 25  2020 A1B.2098.pp
-rw-rw-r-- 2 jypeter lsce 1.8M Jun 25  2020 A1B_north_america.nc
-rw-rw-r-- 2 jypeter lsce  28K Jun 25  2020 air_temp.pp
-rw-rw-r-- 2 jypeter lsce  34K Jun 25  2020 atlantic_profiles.nc
-rw-rw-r-- 2 jypeter lsce 3.5M Jun 25  2020 colpex.pp
-rw-rw-r-- 2 jypeter lsce 110K Jun 25  2020 E1.2098.pp
-rw-rw-r-- 2 jypeter lsce 1.8M Jun 25  2020 E1_north_america.nc
drwxr-xr-x 2 jypeter lsce 4.0K Sep 10  2021 GloSea4/
-rw-rw-r-- 2 jypeter lsce 662K Jun 25  2020 hybrid_height.nc
-rw-rw-r-- 2 jypeter lsce 7.5M Jun 25  2020 NAME_output.txt
drwxr-xr-x 2 jypeter lsce 4.0K Sep 10  2021 NEMO/
-rw-rw-r-- 2 jypeter lsce 2.0M Jun 25  2020 orca2_votemper.nc
-rw-rw-r-- 2 jypeter lsce 1.7M Jun 25  2020 ostia_monthly.nc
-rw-rw-r-- 2 jypeter lsce  26K Jun 25  2020 polar_stereo.grib2
-rw-rw-r-- 2 jypeter lsce 110K Jun 25  2020 pre-industrial.pp
-rw-rw-r-- 2 jypeter lsce  19K Jun 25  2020 rotated_pole.nc
-rw-rw-r-- 2 jypeter lsce 163K Jun 25  2020 SOI_Darwin.nc
-rw-rw-r-- 2 jypeter lsce 243K Jun 25  2020 space_weather.nc
-rw-rw-r-- 2 jypeter lsce 514K Jun 25  2020 toa_brightness_stereographic.nc
-rw-rw-r-- 2 jypeter lsce 3.3M Jun 25  2020 uk_hires.pp
drwxr-xr-x 2 jypeter lsce  12K Sep 10  2021 UM/
-rw-rw-r-- 2 jypeter lsce 2.4K Jun 25  2020 wind_speed_lake_victoria.pp

jypeter avatar Jul 29 '22 09:07 jypeter

Thanks for this @jypeter. This has been discussed and was in-mind, although a GH issue was not opened for it.

I explored a possible implementation similar to xarray. xarray uses a GH repo (https://github.com/pydata/xarray-data) to host test datasets, and provides xarray.tutorial methods to open up the test datasets using a package called pooch.

  • https://github.com/fatiando/pooch
  • xarray.tutorial.open_dataset()
    • Example usage: https://docs.xarray.dev/en/stable/examples/monthly-means.html#Open-the-Dataset

We didn't pursue this idea since xarray supports direct download of data using OpenDAP. However, I think this idea is worthwhile because it standardizes and streamlines the testing processes with easy access to the same real-world datasets.

tomvothecoder avatar Aug 01 '22 23:08 tomvothecoder

Hmmm, I had a quick look at the pooch GH page. It looks really nice and fancy but:

  • it may be an overkill for our purpose, from the end user point-of-view. But xCDAT could indeed use it behind the scene! Or possibly just use requests
  • specifying the input files seems a bit complicated, but it's OK if it only happens behind the scene. The end user should only have to specify a file name, and some xCDAT function should provide the path (either the directory where the file is located, or a full path)
  • you have to be careful where the data files are located! I'm not too sure about a cache that usually depends on the user login or something. When, like me, you install a python distribution for multiple users (where the person installing can write, but other users can't), it's convenient to have files installed in a fixed sub directory of the distribution's lib directory. And I hate default cache locations in hidden sub-directories of the users' home dir. We have nightly backups of the the home dirs at LSCE, and we archive the interns' home dir when they are finished. I don't want to have backups of hidden test files!
  • See also the https://github.com/SciTools/cartopy/issues/1325 ongoing issue about file location and cache problems

Having a dedicated python package with just the data could also be an easy solution: e.g. basemap-data-hires

jypeter avatar Aug 02 '22 08:08 jypeter

Another data sample example from xoa

>>> import xoa

>>> xoa.show_data_samples()
gdp-6203641.csv hycom.gdp.u.nc hycom.gdp.v.nc hycom.gdp.h.nc croco.south-africa.surf.nc hycom.cfg croco.cfg gdp.cfg mercator.cfg argo.cfg croco.south-africa.zonal.nc croco.south-africa.meridional.nc ibi-argo-7900573.nc argo-7900573.nc

>>> xoa.get_data_sample('hycom.gdp.u.nc')
'/home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples/hycom.gdp.u.nc'

> du -sh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples
1.1M    /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples

>ls -lh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples
total 1.1M
-rw-rw-r-- 2 jypeter lsce  92K Feb 25 09:56 argo-7900573.nc
-rw-rw-r-- 2 jypeter lsce  305 Feb 25 09:56 argo.cfg
-rw-rw-r-- 2 jypeter lsce  714 Feb 25 09:56 croco.cfg
-rw-rw-r-- 2 jypeter lsce  61K Feb 25 09:56 croco.south-africa.meridional.nc
-rw-rw-r-- 2 jypeter lsce 190K Feb 25 09:56 croco.south-africa.surf.nc
-rw-rw-r-- 2 jypeter lsce  61K Feb 25 09:56 croco.south-africa.zonal.nc
-rw-rw-r-- 2 jypeter lsce  43K Feb 25 09:56 gdp-6203641.csv
-rw-rw-r-- 2 jypeter lsce   73 Feb 25 09:56 gdp.cfg
-rw-rw-r-- 2 jypeter lsce  487 Feb 25 09:56 hycom.cfg
-rw-rw-r-- 2 jypeter lsce 174K Feb 25 09:56 hycom.gdp.h.nc
-rw-rw-r-- 2 jypeter lsce 173K Feb 25 09:56 hycom.gdp.u.nc
-rw-rw-r-- 2 jypeter lsce 173K Feb 25 09:56 hycom.gdp.v.nc
-rw-rw-r-- 2 jypeter lsce  71K Feb 25 09:56 ibi-argo-7900573.nc
-rw-rw-r-- 2 jypeter lsce  195 Feb 25 09:56 mercator.cfg

jypeter avatar Aug 02 '22 09:08 jypeter

@tomvothecoder was there a plan to have a test suite with just the kind of (few timesteps) data that @jypeter was describing? It seems that CDAT was using the sample_data subdir which enabled testing in the CI envs, similar to what iris appears to do (https://github.com/xCDAT/xcdat/issues/277#issuecomment-1199068571 above)

durack1 avatar Aug 10 '22 00:08 durack1

Note: see example usage of vcs.sample_data + '/tas_mo.nc' in https://github.com/xCDAT/xcdat/issues/310#issuecomment-1212866276

jypeter avatar Aug 12 '22 08:08 jypeter

I have added an Easy to use datasets section to my python page, with test/tutorials datasets from several packages

@tomvothecoder It seems that xarray uses xarray.tutorial.load_dataset. Maybe xcdat could have a similar xcdat.tutorial.load_dataset pointing to some useful sample CMIP6 data (and possibly the equivalent CMIP5 data, if somebody wants to make a CMIP5/CMIP6 comparison example)

jypeter avatar Dec 14 '23 16:12 jypeter

We need to revisit this issue because we use OPeNDAP for E3SM data in our gallery notebooks. OPeNDAP might not be supported anymore by ESGF2 in the near future, specifically at ANL where E3SM data will be hosted.

Options:

tomvothecoder avatar Sep 25 '24 18:09 tomvothecoder