staged-recipes
staged-recipes copied to clipboard
Proposed Recipes for Antarctic ice sheet paleo PISM ensemble
Source Dataset
Simulations of the Antarctic ice sheet over the last 20ka performed by @talbrecht using the Parallel Ice Sheet Model (PISM)
Albrecht, Torsten (2019): PISM parameter ensemble analysis of Antarctic Ice Sheet glacial cycle simulations. PANGAEA, https://doi.pangaea.de/10.1594/PANGAEA.909728
- The file format: netcdf
- How are the source files organized? one directory per ensemble member, in each directory multiple netcdfs corresponding to different snapshots of the model state (e.g., snapshots_-15000.000.nc) and one containing timeseries of aggregations of the model state (timeseries.nc)
- How are the source files accessed (e.g. FTP) zip file freely downloadable from https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip.
Transformation / Alignment / Merging
All ensemble members and time snapshots should be combined in one xarray with dimensions corresponding to x, y, time, and four model parameters.
Also all 'timeseries.nc' files (each one corresponding to one ensemble member should be collated together into another single xarray.
This involves an unstack
step to get the four parameters their own dimensions in the xarray, as discussed here.
Output Dataset
One zarr directory for each of xarrays described above (two in total).
Progress so far
Much of this work has been done using a larger version of the model output (with more timeslices, one every kyr instead of one every 5kyr):
-- all the timeslices and ensemble members were collated and unstacked
'd into the correctly shaped xarray, then uploaded to GCS: https://github.com/ldeo-glaciology/pangeo-pismpaleo/blob/main/pism_paleo_nc_to_zarr.ipynb (note that this was done on the University of Potsdam's HPC and did NOT start with the zip file linked to above).
-- then we made an intake catalog, [here] (https://github.com/ldeo-glaciology/pangeo-pismpaleo/blob/48b16dca56d3b736b6f05acdb63ca83744c4f8d4/intake_catalog_setup.ipynb)
As described here, these data are now accessible from a google bucket, e.g.
cat = intake.open_catalog('https://raw.githubusercontent.com/ldeo-glaciology/pangeo-pismpaleo/main/paleopism.yaml')
snapshots1ka = cat["snapshots1ka"].to_dask()
mask_score_time_series = cat["mask_score_time_series"].to_dask()
These two zarrs are the result of collating all the timeseries.nc and the snapshots_*.nc, respectively (as described above). Additionally we have
vels5ka = cat["vels5ka"].to_dask()
present = cat["present"].to_dask()
which contain just the velocities at 5 kyr resolution and the present day state (t = 0 kyr BP) of the model, respectively.
Here is a notebook showing how to access these data in pangeo.
Question for @talbrecht and @rabernat: should we make this recipe with the smaller dataset contained in the zip, or do we want to use the larger dataset? I like the larger dataset because it is large enough to start really needing clusters and it is more useful for comparing to data when you have that higher time resolution. What do you think?
Thanks @jkingslake for formulating this recipe. Yes, let's do this with the larger dataset, as the community seems to be interested in different variables (e.g. velocities) at different periods, than available in the subset I published at pangaea, which only contains the data necessary for the plots in the related journal publications.
@jkingslake, thanks for submitting this request. Assuming we do go with the larger dataset, where do the source files live for that? Apologies if I missed that in your initial comment; it looked to me as if the info you've provided under Source Dataset above apply to the smaller dataset only?
Also please note Pangeo Forge currently does not support unzipping of source files. Source files must be individually accessible over HTTP, FTP, etc.
Looking forward to supporting you in making this recipe a reality!
The (larger) Source dataset has not been published yet, it is stored on a High Performance Computer in Germany. As an temporary option I could produce a FTP link with password, that could be used to convert individual netCDF data files to zarr?
I could produce a FTP link with password
This would work. We've done something similar before.
Out of curiosity, how large is the (larger) Source dataset? In terms of number of files and number of (giga)bytes.
It will be of the order of 50GB and about 500 individual files (ensemble of 256 with each one timeseries file of spatially aggregated variables (t) and one outputfile containing the 2D variables over time (x,y,t))... Yes, I could prepare the FTP link...
Great. This sounds quite manageable.
Please let me know when the FTP link is available and we can begin the recipe development.
@cisaacstern, thanks for the engagement in this.
@talbrecht, you mention that people have been interested in velocities on higher time resolution than the 5kyr we have currently. So, does this mean we should aim for 1kyr of thickness, bed elevation etc (as we had before) plus the two components of velocity?
Yes, in the README you find that for the whole ensemble I have the following variables available every 1000 years: 'thk','mask','topg','usurf','velbar_mag','dbdt','bmelt' while the two velocity components 'u_ssa','v_ssa' are only available every 5000 years. Yet, for the reference simulation (6165c) I reran the simulation with velocity output every 1000 years, so this could be a separate subset?
Also please note Pangeo Forge currently does not support unzipping of source files. Source files must be individually accessible over HTTP, FTP, etc.
This is not actually true! Fsspec can see inside zip files!
import xarray as xr
from fsspec.implementations.zip import ZipFileSystem
url = "https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip"
fs = ZipFileSystem(url)
fs.ls("datapub) # -> list the files
import xarray as xr
with fs.open('datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc') as fp:
ds = xr.open_dataset(fp)
ds.load()
ds.thk.plot()
We just need how to encode the compound url: https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip + datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc
into a single URL that fsspec understands. @martindurant will know how to do that.
Should be
of = fsspec.open("zip://datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc::https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip")
with of as f:
ds = xr.open_dataset(f)
...
Hej, I have uploaded 35GB of data (not zipped), that can be seen in this temporary link:
rsync rsync://rsync.pik-potsdam.de/paleo_ensemble/
or downloaded with
rsync -r rsync://rsync.pik-potsdam.de/paleo_ensemble model_data
The velocity snapshots are concatenated into one netCDF file for each ensemble member (5ka), while all other data can be found in the extra files (1ka). I added a simulation 6165c, equivalent to the reference simulation (6165) that shows velocity snapshots every 1ka.
Great! A description of the recipe development process is given here:
https://pangeo-forge.readthedocs.io/en/latest/intro_tutorial.html
(This documentation is still quite fresh, so we definitely welcome feedback on it!)
As you will see, the first step is forking this repo and creating a new subdirectory for your recipe within it. Once that happens, you don't have to wait until the recipe is complete before opening a PR. I encourage you to open a PR against this repo with an early draft, that way we can all provide feedback and support you along the development process.
Just noting that we do not necessarily need an unzipped copy of the data. As demonstrated above (https://github.com/pangeo-forge/staged-recipes/issues/90#issuecomment-932760875) we can open and download the data directly from a zip file over HTTP. It would be better to use the "official" source of the data (via Pangaea) than to create a "temporary link" because the former is more likely to be persistent.
When creating the recipe, the FilePattern formatting function could return paths of the form zip://datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc::https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip
. This would eliminate the need for a temporary mirror of the unzipped data.
Based on https://github.com/pangeo-forge/staged-recipes/issues/90#issuecomment-932584463, it seems the high resolution dataset of interest is not yet published. It would be preferable from a provenance and recipe standpoint to build the Zarr store from published sources (zipped or otherwise). @talbrecht, how long do we expect before the data is available in published form?
@talbrecht, I am happy to start the process of making the recipe, but I will wait until we have finalized which dataset we use.
I am guessing that you weren't actually planning on publishing the higher resolution version in pangaea.
OK, I forked the repo and put in the example meta.yaml and recipe.py files to get things started. https://github.com/ldeo-glaciology/staged-recipes/tree/paleo-pism
Thanks @jkingslake for starting the recipe process. Well, my experience is that publishing the data in PANGAEA takes a couple of weeks and the dataset would then be limited to 15GB. I can zip the data, if this is preferred. As the data are already deflated, zipping would not bring much effect, so I would try to split them into two publications or reduce precision. Yes, the rsync link is temporary. I though we could convert it to zarr format and store (publish) it somewhere permanently (in the cloud)?
we could convert it to zarr format and store (publish) it somewhere permanently (in the cloud)?
Yes, with the exception of the permanent part. While any zarr store we write to the cloud is likely to persist for some time, the current design of Pangeo Forge does not allow for it to serve as a permanently published version.
In cases where publishing is impractical, we can write zarr stores from temprorary sources, but it's best if we can do so from published/permanent sources, so that if the zarr store were ever to disappear in the future, it can be rebuilt from the same source. In addition, working from a permanent source means that downstream data users have the option of rebuilding the same zarr store with different parametrization to suit their research objectives (i.e., to a different location, or with different chunking, etc.).
Yes, ok, makes sense. Then I will contact PANGAEA...
@talbrecht, how do we tell what parameter values are used for each ensemble member from the netcdfs in the zip file? I started trying to collate them (just to see if I could, not to make the final recipe), but I realized I didnt know how to tell what parameters correspond to each one. In your NB, the parameter values come from some .csv files, not the netcdfs.
Yes, the csv file is located in the folder "aggregated_data", available in both the pangaea and the rsync link (which I just realized seems not to be complete yet). Here the two tables are attached: pism1.0_paleo06_6000.csv le_all06_16km.txt
In cases where publishing is impractical, we can write zarr stores from temprorary sources, but it's best if we can do so from published/permanent sources, so that if the zarr store were ever to disappear in the future, it can be rebuilt from the same source.
I think we need to think through this scenario more carefully. Under what circumstances can we actually just publish the data? Not all contributors will have their data in an existing repository? I think we should support this somehow. As @talbrecht said, most existing repositories have very small limits on the size of their archives. Pangeo Forge can help get around that limitation.
Despite what I said above in https://github.com/pangeo-forge/staged-recipes/issues/90#issuecomment-938729722, at this [early, experimental] stage of development of Pangeo Forge, I don't think we should exclude data that is just stored temporarily on an FTP server, especially if its size exceeds what is possible in existing "official" repositories. Perhaps we should move ahead with the recipe outside of PANGAEA.
Under what circumstances can we actually just publish the data? Not all contributors will have their data in an existing repository?
Should we start an Issue in https://github.com/pangeo-forge/roadmap/issues to discuss a generalized "policy" for this?
(That discussion doesn't have to block work on this particular recipe, of course.)
I have submitted a new dataset to Pangaea, including ice thickness, bed topography, basal melt rates (about 15GB) and ice flow velocity components (about 7GB). Unfortunately, they are "currently facing a high rate of data submissions ... and thus the editorial process and minting of DOI names might take up to 12 weeks." I'll keep you updated...
Then let's go ahead and make the data submission via the FTP server.
OK, the rsync link mentioned above should now point to the two zipped datasets, which will be hopefully published in PANGAEA in some weeks...
rsync rsync://rsync.pik-potsdam.de/paleo_ensemble/
or downloaded with
rsync -r rsync://rsync.pik-potsdam.de/paleo_ensemble model_data
The velocity snapshots are concatenated into one netCDF file for each ensemble member (5ka), while all other data can be found in the extra files (1ka). I added a simulation 6165c, equivalent to the reference simulation (6165) that shows velocity snapshots every 1ka.
Thanks a lot @talbrecht! I'm really excited about this.
I want to clarify that Pangeo Forge cannot use rsync protocol. (We only support protocols that fsspec has implemented.) So we need to be able to access the data via http / [s]ftp, etc. Is there any other protocol available? @martindurant - do you have any insight on this?
You could try this link from my personal website: http://www.pik-potsdam.de/~albrecht/pism_pangeo/ ?!
Did anyone try the http link?
Did anyone try the http link?
Hi @talbrecht thanks for following up.
@jkingslake, is this recipe something you are interested in working on? Pangeo Forge is designed as a platform to support recipe contributions from the community, and we would love to have your participation. If so, the Introduction Tutorial is a great place to start, and I can happily respond to any questions you have. (If that Tutorial is unclear in any way, we can also use your feedback to improve it.)