staged-recipes icon indicating copy to clipboard operation
staged-recipes copied to clipboard

Example pipeline for AMPS output stored at NCAR

Open porterdf opened this issue 4 years ago • 19 comments

Source Dataset

Retrieving output from AMPS archive on NCAR HPSS (tape archive quickly nearing its End-Of-Life) for public storage on Google Cloud Services and access by Pangeo and more generally xarray

The Antarctic Mesoscale Prediction System (AMPS) is a real-time weather forecasting tool primarily in support of the NSF's United States Antarctic Program (USAP). It consists of the assimilation of surface and upper air observations into the Weather Research and Forecasting (WRF) model - forced at its boundaries by the GFS model. There are two outer nested pan-Antarctic domains with several additional higher resolution domains over areas-of-interest.

  • https://www2.mmm.ucar.edu/rt/amps/information/information.html
  • NetCDF, GRIB
  • One WRF time slice per history file. Generally, there are two simulations per day (00z and 12z initialization), with varying model itegration lengths. History file period is generally 3 hours. Since AMPS is an operational product over the history of AMPS project, grid resolution, domain size/location, and model code, parameterizations/physics, and setup, have not been held consistent. This means that model output moved to Google Cloud Storage will be a balanced mix of these, with priority for immediate or planned regions or processes of interest.
  • Residing on the HPSS tape archive, first task is to pull them from tape to local 'online' GLADE storage, where they reside temporarily while awaiting transfer off-site. Options for tranfer include GLOBUS, bbcp, and scp/sftp.
    • https://www2.mmm.ucar.edu/rt/amps/information/amps_archive_hpss_details.html
  • Data is on NCAR HPC, so an account and appropriate credentials required. Since there will be no computations involved, just data movement, suggest to request "Small" allocation account.

Transformation / Alignment / Merging

The raw WRF output files are NetCDF (and GRIB) format and usable by many software packages, however several common post-processing procedures can both make the data smaller and more usable (e.g. converting to pressure level data, subsetting fields, and de-staggering winds).

Output Dataset

In both cases, these raw or post-processed NetCDF files should be converted to a zarr object for optimization in cloud-based xarray routines. Ideally this conversion (e.g. using xarray method to_zarr) would occur either on the NCAR HPC or within Pangeo or similar cloud computing environment.

porterdf avatar Aug 22 '20 21:08 porterdf

This is an interesting example because it requires ssh access to the supercomputer. That sounds very hard to automate. For example, my NCAR login uses two-factor authentication--it's impossible to use from a script.

We would want to consult with CISL about the best way to do this. The page on data transfers has lots of useful information.

One thing that might work is the following:

  • Create a cron job that runs on cheyenne to copy the data from HPSS to Glade
  • Pangeo forge uses Globus to download the data from Glade

rabernat avatar Aug 25 '20 13:08 rabernat

it requires ssh access to the supercomputer. That sounds very hard to automate

Could pexpect help?

davidbrochart avatar Aug 25 '20 13:08 davidbrochart

Yes, I figured this use-case was just different enough from the existing staged recipe, and potentially more broadly applicable, that it's challenges are worth thinking about.

I am currently doing (most of) what the recipe describes, though much of it incrementally and by hand. Scripts retrieve data from HPSS in a 'friendly' way, while others transfer either directly to Google Cloud bucket or our group's linux server.

This example pipeline was submitted at the suggestion by @tjcrone, who's working with Jonny Kingslake and myself on some glaciology sub-projects.

porterdf avatar Aug 29 '20 22:08 porterdf

If we can get a bakery running somewhere inside the NCAR network, that should make this possible.

See https://github.com/pangeo-forge/pangeo-forge/issues/41 for GRIB inputs.

rabernat avatar Jan 24 '21 15:01 rabernat

This should be ready to go now, if we can figure out a way to automatically suck files out of HPSS / Glade. @mgrover1 mentioned that he might be able to work with the CISL folks to help figure this out.

Max, what kind of options do you see? Globus would probably be the default way to go. But Globus works pretty differently than our existing setup. I also imagine the authentication is a pain. It would be great if we could use one of the existing implemented fsspec protocols to pull data from NCAR.

rabernat avatar Apr 06 '21 01:04 rabernat

It seems like GLOBUS is the recommended way to go... do you have sample filepaths I can test with?

mgrover1 avatar Apr 06 '21 02:04 mgrover1

You would need to ask @porterdf, the creator of this example.

However, I have concerns about globus, as noted above. There are many obstacles to implementing globus-based transfer, most practically, the lack of an fsspec implementation for globus. I also find the globus API extremely confusing. Is there any way we could just use http, scp, ftp, anything else from the list?

rabernat avatar Apr 06 '21 02:04 rabernat

Okay - I will reach out to him.

There are scp and sftp options, although the CISL documentation explicitly states "They are best suited for transferring small numbers of small files (for example, fewer than 1,000 files totaling less than 200 MB). For larger-scale transfers, we recommend using Globus."

mgrover1 avatar Apr 06 '21 02:04 mgrover1

I've been transferring files over in piecemeal fashion, mostly using the methodology scathed out in our repo's readme: pangeo-AMPS

However there is still more AMPS output we will be moving to GCP eventually, so happy to contribute anyway I can, and kick it down the road a bit further. I've created a directory on Glade for some files to be transferred - perhaps they are ripe for testing a recipe?

/glade/scratch/porterdf/AMPS/WRF_24/domain_03

porterdf avatar Apr 07 '21 19:04 porterdf

@rabernat would the bakery move all these files at the same time (such that if they were in small chunks, they would be under the recommended limit?) I am planning on testing this workflow with the CESM2-LE dataset as well... there is also an internal s3-like storage system (called stratus) which I found out about this week (not sure if there would be a way to "stage" files on there)?

mgrover1 avatar Apr 10 '21 23:04 mgrover1

would the bakery move all these files at the same time

The default mode of operation is to just start up a couple simultaneous http transfers.

I think that we should find a way to use globus for these transfers from NCAR. The transfer would have to occur outside of the recipe itself. We will have to think about how to integrate pangeo forge with other "file transfer services".

there is also an internal s3-like storage system (called stratus) which I found out about this week (not sure if there would be a way to "stage" files on there)?

If it's on the public internet, yes, this would be useful.

rabernat avatar Apr 11 '21 22:04 rabernat

I had no problems using GLOBUS to transfer WRF output from NCAR to a local server, but there are additional steps/features/premium options needed to set up a GCP bucket as an endpoint (I hit this wall, and reverted to parallel rsync'ing through gsutil)

porterdf avatar Apr 15 '21 12:04 porterdf

Just a note here to say that I have been looking at ISMIP6 data as another option for a pangeo-forge recipe and it is stored at ghub, who pointed me to Globus to access the data. @porterdf, I suspect I ran into the same issues you did when looking into setting up an endpoint in a GCS bucket. Did you find these instructions and then get lost, like I did?

jkingslake avatar Oct 05 '21 22:10 jkingslake

I think we need to talk to the Globus folks about the best way to have Globus work with Pangeo Forge.

rabernat avatar Oct 06 '21 21:10 rabernat

@rabernat any updates on pinging someone from Globus on this?

ricardobarroslourenco avatar Mar 09 '22 19:03 ricardobarroslourenco

@ricardobarroslourenco I invited some of the Globus folks to the meeting on Monday, and pointed them to #222 over in the Pangeo Forge Recipes repo

mgrover1 avatar Mar 09 '22 19:03 mgrover1

We made some progress on globus in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/222. The key trick is to create a Globus Guest Collection, which allows you to access the data over HTTP. So using this mechanism, we can access globus data today with Pangeo Forge.

However, I have learned from CISL support that NCAR does not have a Globus v5 subscription yet, which means it is not possible to create guest collections on Glade. Anyone reading this who wants to move the issue forward should reach out to NCAR and encourage them to upgrade their subscription.

We can also continue to pursue a Globus client for Pangeo Forge. That would allow us to use the existing recommend way to share NCAR data via Globus. Because that requires a globus login (rather than vanilla HTTP access), it is more complicated on our end.

rabernat avatar Mar 29 '22 20:03 rabernat

@rabernat I was interested in fixing up a recipe for iTRACE output, but came up against an authentication issue. I don't have a nuanced understanding of what data lives where and is accessible by what means, but in the course of hoping that a Globus-NCAR pipeline had sprung up recently, I stumbled across a headline that maybe NCAR upgraded its subscription.

(Also, thanks for all the useful tidbits of information you seem to pepper this corner of the internet with. On multiple occasion, your breadcrumbs have either solved my problem or made it clear that they would require an entirely different approach.)

jordanplanders avatar Jan 27 '23 06:01 jordanplanders

However, I have learned from CISL support that NCAR does not have a Globus v5 subscription yet, which means it is not possible to create guest collections on Glade.

FYI I checked and NCAR CISL confirmed that they have since transitioned to Globus v5 (in November 2022). (cc @jbusecke)

TomNicholas avatar Oct 26 '23 21:10 TomNicholas