staged-recipes
staged-recipes copied to clipboard
Example pipeline for SWOT-Xover
Source Dataset
SWOT-Xover is a subset of a few basin-scale model outputs with the resolution of ~1/50° surface hourly and interior daily data. The subsets will cover the cross-over regions of the SWOT fast-sampling phase.
- Project description is given here
- File format: zarr
- Organization of file: one file for six-months of surface and interior data each (i.e. two files per model per region).
- File access: automating the zarrification of datasets pulled from FTP servers.
Transformation / Alignment / Merging
Files should be concatenated along the time dimension.
Output Dataset
The zarrification of data should be automated via the pangeo-forge pipeline following the pangeo-forge recipe. In order to facilitate the automation, we would ask each modelling group to have the outputs in netcdf4 format and make it available via an ftp server.
A single monthly file of daily-averaged 3D data of u, v, w, T & S
in one region is ~30Gb. With the four regions, six months and five models, this would sum up to ~3.6Tb in total on the cloud storage. The chunks of the zarr dataset will be on the order of {'time':30, 'z':5, 'y':100, 'x':100}
.
For the surface, a single daily file of hourly averaged data of SST, SSS, SSH, wind stress & buoyancy fluxes
in one region is ~380Mb. With the regions, months and models, this sums up to ~45Gb. The chunks of the zarr dataset will be on the order of {'time':100, 'y':100, 'x':100}
Can you provide more details about the input files? How big are they? What URLs will we use to download them?
I was hoping that each modelling group could upload their zarrified data to the Wasabi cloud storage...
I was hoping that each modelling group could upload their zarrified data to the Wasabi cloud storage...
Then this is not a Pangeo Forge pipeline. The point of Pangeo Forge is to automatically put together the Zarr in the cloud.
What you propose is fine--it's just not part of Pangeo Forge. Let's leave this open for now as we figure out the best path forward.
Then this is not a Pangeo Forge pipeline. The point of Pangeo Forge is to automatically put together the Zarr in the cloud.
What you propose is fine--it's just not part of Pangeo Forge. Let's leave this open for now as we figure out the best path forward.
I think we're going to try the Pangeo Forge pipeline for the eNATL60 data. Pending on how this goes, we may recommend other modelling centers to follow the pipeline.
Great! To move forward, we need some more details about exactly where to find the data and how it is formatted. Please edit your original issue to conform to the template (https://github.com/pangeo-forge/staged-recipes/issues).
Yes, I'm still working on extracting the cross-over regions (which surprisingly takes time dealing with massive netCDF files) but I will update the details as soon as I get this hashed out.
(which surprisingly takes time dealing with massive netCDF files)
If only there were a better format! 🤣 😉
This is getting a bit ahead of ourselves but in the case we would ask the modelling groups to provide their data via ftp or opendap links for the pangeo-forge pipeline, would the "computation" costs to upload them to the cloud come out from the payments we'll be making to 2i2c? I'm only asking because I think it would be best if we could reduce the amount of hassle each modelling group goes through. The idea I had in mind was to develop the pipeline on the SWOT-AdAC Jupyterhub.
@rabernat I added a bit more detail in the output data
section. Is this sufficient?
Is this sufficient?
Can you provide an actual working FTP link to one of the datasets?
would the "computation" costs to upload them to the cloud come out from the payments we'll be making to 2i2c?
No, they will be supported by pangeo forge and our NSF grant. 2i2c is for the jupyterhub.
A single monthly file of daily-averaged 3D data of
u, v, w, T & S
in one region is ~30Gb.
This will require https://github.com/pangeo-forge/pangeo-forge/issues/49, a feature that is not yet implemented. We are working on it.
Can you provide an actual working FTP link to one of the datasets?
Sorry for the lagged response. Here is a working link: https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/catalog/meomopendap/extract/SWOT-Adac/Interior/eNATL60/catalog.html
Here is another working ftp link for lNALT60: https://data.geomar.de/downloads/20.500.12085/0e95d316-f1ba-47e3-b667-fc800afafe22/data/
Ok thanks for these. Will have a look soon.
I talked with @lesommer, and we decided to try putting this data in OSN for now.
The eNATL60 regional outputs for regions 1-3 are now all available here: https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/catalog/meomopendap/extract/SWOT-Adac/catalog.html
FYI, that server is giving HTTP certificate errors.
$ curl -I https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html
Would it be possible to get this fixed?
FYI, that server is giving HTTP certificate errors.
$ curl -I https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc curl: (60) SSL certificate problem: certificate has expired More details here: https://curl.haxx.se/docs/sslcerts.html
Would it be possible to get this fixed?
@auraoupa Do you know why this is happening...?
Yes it is a know issue of our opendap (expired certificate), we get by it by adding --no-check-certificate to our wget commands, it would be --insecure for curl (did not try it), but it maybe more efficient (and cleaner) to have it fixed ... I'll try to make it happen !
We need to get the files via fsspec and unfortunately I don't (yet) know how to work around the certificate error...but there must be a way! I'll try to dig deeper on my end too.
Now it looks like the server https://ige-meom-opendap.univ-grenoble-alpes.fr/ is down completely? This is making it hard to develop the recipe.
Sorry about that, it should be ok now. About the certificate, the University should be fixing it soon they say ! I keep you posted
The certificate is now valid, I hope it helps for the development of the recipe !
Success!
url = 'https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc'
with fsspec.open(url) as fp:
ds = xr.open_dataset(fp)
display(ds)
<xarray.Dataset>
Dimensions: (time_counter: 720, x: 574, y: 675)
Coordinates:
nav_lon (y, x) float32 ...
nav_lat (y, x) float32 ...
time_centered (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010...
* time_counter (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010...
depth (y, x) float32 ...
lat (y, x) float32 ...
lon (y, x) float32 ...
e1t (y, x) float64 ...
e2t (y, x) float64 ...
e1f (y, x) float64 ...
e2f (y, x) float64 ...
e1u (y, x) float64 ...
e2u (y, x) float64 ...
e1v (y, x) float64 ...
e2v (y, x) float64 ...
Dimensions without coordinates: x, y
Data variables:
sossheig (time_counter, y, x) float32 ...
sozocrtx (time_counter, y, x) float32 ...
somecrty (time_counter, y, x) float32 ...
sosstsst (time_counter, y, x) float32 ...
sosaline (time_counter, y, x) float32 ...
sozotaux (time_counter, y, x) float32 ...
sometauy (time_counter, y, x) float32 ...
qt_oce (time_counter, y, x) float32 ...
sowaflup (time_counter, y, x) float32 ...
tmask (y, x) int8 ...
umask (y, x) int8 ...
vmask (y, x) int8 ...
fmask (y, x) int8 ...
Success!
url = 'https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc' with fsspec.open(url) as fp: ds = xr.open_dataset(fp) display(ds)
<xarray.Dataset> Dimensions: (time_counter: 720, x: 574, y: 675) Coordinates: nav_lon (y, x) float32 ... nav_lat (y, x) float32 ... time_centered (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010... * time_counter (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010... depth (y, x) float32 ... lat (y, x) float32 ... lon (y, x) float32 ... e1t (y, x) float64 ... e2t (y, x) float64 ... e1f (y, x) float64 ... e2f (y, x) float64 ... e1u (y, x) float64 ... e2u (y, x) float64 ... e1v (y, x) float64 ... e2v (y, x) float64 ... Dimensions without coordinates: x, y Data variables: sossheig (time_counter, y, x) float32 ... sozocrtx (time_counter, y, x) float32 ... somecrty (time_counter, y, x) float32 ... sosstsst (time_counter, y, x) float32 ... sosaline (time_counter, y, x) float32 ... sozotaux (time_counter, y, x) float32 ... sometauy (time_counter, y, x) float32 ... qt_oce (time_counter, y, x) float32 ... sowaflup (time_counter, y, x) float32 ... tmask (y, x) int8 ... umask (y, x) int8 ... vmask (y, x) int8 ... fmask (y, x) int8 ...
Sorry, I missed this. This is great news! Could you let us know what the status is regarding the data storage on OSN @rabernat ??
The status is that I'm still working on it. I hope to be able to start ingesting data soon (next week). I'm deeply sorry for the delays and I thank you for your patience.
thanks for all your work with this @rabernat !
I started a PR #24 for the recipe.
@rabernat Could we prioritize pushing the surface data to the cloud for all available models (in #26, #27, #29) before the interior 3D data? Since we have a few different models ready to push, I think there are already a few inter-model analyses that could be done with just the surface data :)
@rabernat @cisaacstern I've started analyzing the SWOT-AdAC data (#24 #26 #29 ) on a Google Cloud based Jupyterhub but does the OSN storage also support storing of analysis data?
does the OSN storage also support storing of analysis data?
No, we cannot provide write access to OSN.
Can you explain more about the use case you have in mind? How much data do you imagine needing to write? Does it need to be shared across users?
For writing data, you have a few options:
- Store data in your jupyter home directory (suitable for smallish data; not accessible from dask workers)
- Ask 2i2c to set up a shared NFS storage volume that is accessible from the dask workers (suitable for medium data)
- Ask 2i2c to set up a pangeo-style scratch bucket in Google Cloud Storage (suitable for big data)
I believe all of the surface datasets are now on OSN. Returning to this main thread to provide an high-level "flyover" of how it's organized. Note that below, fs_osn
and swot
are always defined as:
import s3fs
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)
swot = "Pangeo/pangeo-forge/swot_adac"
💺 Fasten your seatbelt, this will be a long one!
INALT60 #26
fs_osn.ls(f"{swot}/INALT60")
['Pangeo/pangeo-forge/swot_adac/INALT60/grid.zarr',
'Pangeo/pangeo-forge/swot_adac/INALT60/surf_flux_1d.zarr',
'Pangeo/pangeo-forge/swot_adac/INALT60/surf_ocean_4h.zarr',
'Pangeo/pangeo-forge/swot_adac/INALT60/surf_ocean_5d.zarr']
We currently a single zarr store for each surface dataset. The time dimension for these data is non-contiguous as seen in the recipe here. If it's useful, I can separate each of these surface datasets into separate seasonal stores, as demonstrated in the other recipes below.
GIGATL #27
fs_osn.ls(f"{swot}/GIGATL")
['Pangeo/pangeo-forge/swot_adac/GIGATL/Region01',
'Pangeo/pangeo-forge/swot_adac/GIGATL/Region02',
'Pangeo/pangeo-forge/swot_adac/GIGATL/surf_reg_01.zarr']
@roxyboy, unless you need it for something, I will delete surf_reg_01.zarr
which is missing the input for Jan 28 as you identified in https://github.com/pangeo-forge/staged-recipes/pull/27#issuecomment-853104775.
For each region's surface data, there are both aso
(Aug, Sep, Oct) and fma
(Feb, Mar, Apr) stores:
fs_osn.ls(f"{swot}/GIGATL/Region01/surf")
['Pangeo/pangeo-forge/swot_adac/GIGATL/Region01/surf/aso.zarr',
'Pangeo/pangeo-forge/swot_adac/GIGATL/Region01/surf/fma.zarr']
fs_osn.ls(f"{swot}/GIGATL/Region02/surf")
['Pangeo/pangeo-forge/swot_adac/GIGATL/Region02/surf/aso.zarr',
'Pangeo/pangeo-forge/swot_adac/GIGATL/Region02/surf/fma.zarr']
The fma
stores should both contain the previously missing Jan 28 data. (h/t @rabernat for showing me how to ammend and reuse the existing cache.)
HYCOM50 #29
fs_osn.ls(f"{swot}/HYCOM50")
['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region01_GS',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region02_GE',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region03_MD',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/grid_01.zarr',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/grid_02.zarr',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/grid_03.zarr',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/surf_01.zarr',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/surf_02.zarr',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/surf_03.zarr']
For each region defined in the recipe, there are both aso
and fma
stores:
fs_osn.ls(f"{swot}/HYCOM50/Region01_GS/surf")
['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region01_GS/surf/aso.zarr',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region01_GS/surf/fma.zarr']
fs_osn.ls(f"{swot}/HYCOM50/Region02_GE/surf")
['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region02_GE/surf/aso.zarr',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region02_GE/surf/fma.zarr']
fs_osn.ls(f"{swot}/HYCOM50/Region03_MD/surf")
['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region03_MD/surf/aso.zarr',
'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region03_MD/surf/fma.zarr']
@roxyboy, `surf_01.zarr`, `surf_02.zarr`, and `surf_03.zarr` are the earlier drafts where non-contiguous data is concatenated together. Do you have any use for them now that the seasonal stores are up? If not, I'll delete.
eNATL60 #24
fs_osn.ls(f"{swot}/eNATL60")
['Pangeo/pangeo-forge/swot_adac/eNATL60/Region01',
'Pangeo/pangeo-forge/swot_adac/eNATL60/Region02',
'Pangeo/pangeo-forge/swot_adac/eNATL60/Region03']
For each of the regions, aso
and fma
stores are provided for the surface_hourly
data:
fs_osn.ls(f"{swot}/eNATL60/Region01/surface_hourly")
['Pangeo/pangeo-forge/swot_adac/eNATL60/Region01/surface_hourly/aso.zarr',
'Pangeo/pangeo-forge/swot_adac/eNATL60/Region01/surface_hourly/fma.zarr']
fs_osn.ls(f"{swot}/eNATL60/Region02/surface_hourly")
['Pangeo/pangeo-forge/swot_adac/eNATL60/Region02/surface_hourly/aso.zarr',
'Pangeo/pangeo-forge/swot_adac/eNATL60/Region02/surface_hourly/fma.zarr']
fs_osn.ls(f"{swot}/eNATL60/Region03/surface_hourly")
['Pangeo/pangeo-forge/swot_adac/eNATL60/Region03/surface_hourly/aso.zarr',
'Pangeo/pangeo-forge/swot_adac/eNATL60/Region03/surface_hourly/fma.zarr']
Next steps
@roxyboy, please let me know if you run into any issues with any of the above. Also, what should we work on next? Adding the interior data?