MetPy AWS Data Clients

Description Of Changes

This adds some clients that make it possible to request a range of products, or the single closest product, to a given time/time range, based on some other product id/sites. This applies to the noaa-nexrad-level2, unidata-nexrad-level3, and noaa-goes1[6-8] S3 buckets.

This is the first PR in recent times that adds remote data access capabilities to MetPy (again).

Checklist

[ ] Tests added
[ ] Fully documented

Apr 07 '23 23:04 dopplershift

This is still very much todo with a need for tests and docstrings, but I want to see if anyone has API/interface thoughts. Such as: is the class-instance based interface confusing? Naming? How's the use of the Product class to wrap returns? Should get_product() be get_nearest()? Or some other names?

Todo:

GOES needs the range query capability implemented
Better examples for each of the archives
One example that combines data from all 3 archives (nexrad 2, nexrad 3, and some GOES) would be sweet. An example with radar on top of satellite with maybe tornado detections comes to mind
Need test infrastructure for these
Are parse(), download(), and file all we need for Product?
How do we want to handle the reliance on boto3?

Ping @kgoebber @deeplycloudy

Apr 07 '23 23:04 dopplershift

I'll add my 2 cents based on my experience with goes2go...apologies for the long comment.

For comparison, the goes2go API is roughly this...

from goes2go import GOES

G = GOES(satellite=16, product="ABI-L2-MCMIP", domain='C')

# each of these downloads then reads the data with xarray
G.nearesttime('2022-01-01 6:15')
G.latest()
G.timerange(start='2022-06-01 00:00', end='2022-06-01 01:00')
G.timerange(recent='30min')

Main comment

Even though the GOES data is in different buckets for each satellite, it's practically a single data type (like how NEXRAD is made of different sites, GOES is just different satellites).

Instead of

from metpy.remote import GOES16Archive, GOES17Archive, GOES18Archive

I would prefer a single import and then specify the satellite in my request. Something like...

from metpy.remote import GOESArchive

GOESArchive(satellite=16)

This would make it easier for a user to change the satellite they want without adding/changing an import.

Minor comments

One feature of goes2go is the ability to download files locally and will read the local file if it exists rather than going to AWS for the file. This has been popular for users who want to work with some data offline or reuse files a lot (case study). This might be out of scope for this PR.

From an easy-of-use perspective, I find it easier to write code when date inputs for an API like this can optionally be given as a datetime string (I've always used pandas.to_datetime to parse these strings because they can be formatted in different ways; maybe there are other ways).

# This is easier to write and read...
GOESArchive().get_product(dt="2021-01-01 06:00")
GOESArchive().get_product(dt="20210101T06")

# than this...
GOESArchive().get_product(dt=datetime(2021,1,1,6,0))

Not all of it is pretty, but I'd be happy to share other aspects of goes2go and why I did certain things if you're interested. When this is merged, I might update goes2go to use this instead; this PR seems more robust and future-proof.

Apr 08 '23 05:04 blaylockbk

@blaylockbk Thanks for the feedback! The point about using a single GOES client with different satellites is a really good suggestion, thanks.

I'd love to hear more considerations based your experience with goes2go. I'm certainly staring at your top comment with the API thinking over the strengths and weaknesses of the approach I took in comparison.

I'm mixed on the idea of accepting strings for date/time; on one hand, it does seem to make it easy; on the other, it seems really weird to couple this code to pandas functionality when it otherwise isn't using Pandas at all. I don't think specifying one particular string format would be nearly as handy, though. (Given that Pandas is already a MetPy dependency, this is probably something I just have to get over.) Are there some use cases for supporting the string input that aren't direct user input? (i.e. pulling from another source)

Apr 10 '23 19:04 dopplershift

I'm mixed on the idea of accepting strings for date/time; on one hand, it does seem to make it easy; on the other, it seems really weird to couple this code to pandas functionality when it otherwise isn't using Pandas at all.

@dopplershift, yeah that's just a personal preference because I'm lazy. Anytime I write a function with date or datetime input I convert it using pandas (about 80% of the time I already imported pandas for something else); probably not the best practice for everyone...

import pandas as pd
from datetime import datetime, timedelta

def my_function(date, delta):
    date = pd.to_datetime(date)
    delta = pd.to_timedelta(delta)
    return date + delta

a = my_function("2023-01-01 06:00", "3H")
b = my_function(datetime(2023,1,1,6,0), timedelta(hours=3))

It's mainly a convenience for direct user input (notebooks), but often I have a list of ISO dates I need to loop over. Not a problem if you don't like it.

Apr 11 '23 15:04 blaylockbk

Another "feature" would be allowing an alias "east" or "west" that switches to the appropriate satellite depending on which was operational for the date requested.

GOESArchive(satellite="west") # could be 17 or 18, depending on the date requested

Apr 11 '23 23:04 blaylockbk

It's mainly a convenience for direct user input (notebooks), but often I have a list of ISO dates I need to loop over. Not a problem if you don't like it.

Eh, it doesn't have to match my personal preferences necessarily. It's about the engineering trade-offs. It's entirely possible the complexity/coupling is worth it to yield a better user experience. That's why I'm trying to figure out what the concrete benefits are.

Apr 12 '23 00:04 dopplershift

To test this functionality I modified python-training#136 to plot GLM data. See my comment there for what that looks like. The GOESArchive client works well to pull the data.

Some thoughts on aspects of data use unique to GLM:

GLM data come in 20s bundles. That’s usually too small to give a representative view of “lightning at this time”, and so one has to download multiple files and loop over many Datasets. Is that a user convenience worth adding? Operational use defaults to a 5 min aggregation. Is it worth trying to match it to one of the ABI cadences (full disk, conus, or mesoscale)?
- I can contribute code for concatenating GLM files into one Dataset, though it's a full screen of code.
Subsetting the GLM LCFA files in a self-consistent way requires something like glmtools to handle the flash-group-event tree. If you just want to plot flashes and groups in whatever field of view you have, this step is not necessary, but I could see this question arising if someone wanted to do a more sophisticated data reduction.
A better visualization solution for most users would be to use the GLM gridded imagery, which Unidata provides in real-time through THREDDS, but since those are a psuedo-operational product, they are not on NOAA’s S3. NASA has kindly added them as L3 products to their archive. They are on S3, but behind an EarthData login, but if they were more open I'd love to add them to the GOESArchive to abstract across data repositories.
- They are in 1 min files, and so also are usually aggregated before use. That’s a pretty trivial operation in xarray.

May 26 '23 16:05 deeplycloudy

I had occasion to think about model data, which is also now increasingly on S3. I wanted to document that here for further rumination about API design.

The docs for the GEFS at the link above are somewhat out of date. For yesterday's data, atmos is in the path and 0p50 in the filenames for reasons that are unclear. Earlier years (e.g., 2017) have a different structure.

There are also many file types, but I needed geopotential height, which is in the "popular variables" file type. For one time for one member, the key looks like: gefs.20231106/00/atmos/pgrb2ap5/gep01.t00z.pgrb2a.0p50.f120

Below is code for downloading and concatenating all the ensemble members for one time. It shows the parameters that need to templated for this (admittedly narrow) use case.

output = '/data/'
S3bucket = 'noaa-gefs-pds'
s3 = boto3.resource('s3', config=Config(signature_version=botocore.UNSIGNED,
                                        user_agent_extra='Resource'))
bucket = s3.Bucket(S3bucket)

s3ymd = datetime(2023,11,6).strftime('%Y%m%d')
s3hr = 0 # 06, 12, 18
s3members = np.arange(1, 20+1, 1)
s3fhour = 60 # arange(0,384,6)

for s3member in s3members:
    S3baserun = f"gefs.{s3ymd}/{s3hr:02d}/atmos/pgrb2ap5/"
    S3grib = f"gep{s3member:02d}.t{s3hr:02d}z.pgrb2a.0p50.f{s3fhour:03d}"
    S3key = S3baserun+S3grib
    outfile = os.path.join(outpath, S3grib)
    with open(outfile, 'wb') as fileobj: 
        bucket.download_fileobj(S3key, fileobj)

All the members for one time can then be concatenated with cfgrib as follows:

f060grib = glob(os.path.join(outpath, '*f060'))
gribs = xr.open_mfdataset(f060grib, engine="cfgrib", 
                          combine='nested', concat_dim='ens',
                          backend_kwargs={'filter_by_keys':
                                          {'typeOfLevel': 'isobaricInhPa', 'shortName':'gh'}})

Nov 08 '23 01:11 deeplycloudy

MetPy MetPy copied to clipboard

AWS Data Clients

Description Of Changes

Checklist

Main comment

Minor comments

MetPy
MetPy copied to clipboard