pyschism `hrrr3.HRRR` fails when fetching 2 day forecast

6-hour cycles of HRRR (either on NOMADS or AWS) have 48 hour forecast data. However, when using HRRR in the following script, it fails to download the data:

import pandas as pd
import numpy as np
from matplotlib.transforms import Bbox
from datetime import datetime, timedelta
from pyschism.forcing.nws.nws2 import hrrr3

box = Bbox([[-70, 44], [-69, 45]])

now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()

start = (last_cycle - timedelta(days=1)).replace(hour=0)
end = last_cycle + timedelta(days=2)

rndelta = end -start
rnday = rndelta.total_seconds() / 86400.

hrrr3.HRRR(start_date=start, rnday=rnday, record=2, bbox=box)

with the following error:

RemoteTraceback                           Traceback (most recent call last)                                                                                                                               [72/3696]RemoteTraceback:
"""
Traceback (most recent call last):
  File "LIBPATH/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "LIBPATH/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "REPOPATH/pyschism/forcing/nws/nws2/hrrr3.py", line 129, in gen_sflux
    inventory = AWSGrib2Inventory(date, record, pscr)
  File "REPOPATH/pyschism/forcing/nws/nws2/hrrr3.py", line 46, in __init__
    for obj in page['Contents']:
KeyError: 'Contents'
"""

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In [59], line 1
----> 1 hrrr3.HRRR(start_date=start, rnday=rnday, record=2, bbox=box))

File REPOPATH/pyschism/forcing/nws/nws2/hrrr3.py:122, in HRRR.__init__(self, start_date, rnday, pscr, record, bbox)
    119         logger.info(f'npool is {npool}')
    120         pool = mp.Pool(npool)
--> 122         pool.starmap(self.gen_sflux, [(date, record, pscr) for date in datevector])
    123 #        [self.gen_sflux(date, record, pscr) for date in datevector]
    125         pool.close()

File LIBPATH/python3.9/multiprocessing/pool.py:372, in Pool.starmap(self, func, iterable, chunksize)
    366 def starmap(self, func, iterable, chunksize=None):
    367     '''
    368     Like `map()` method but the elements of the `iterable` are expected to
    369     be iterables as well and will be unpacked as arguments. Hence
    370     `func` and (a, b) becomes func(a, b).
    371     '''
--> 372     return self._map_async(func, iterable, starmapstar, chunksize).get()

File LIBPATH/python3.9/multiprocessing/pool.py:771, in ApplyResult.get(self, timeout)
    769     return self._value
    770 else:
--> 771     raise self._value

KeyError: 'Contents'

Note the line numbers might have shifted since I added some print calls to debug.

This does not fail if I try it with the old implementation of HRRR hrrr.HRRR. However the limitation from NOMADS is that it only has archive data from the day before:

...
from pyschism.forcing.nws import HRRR

hrrr = HRRR()
hrrr.write('RUNDIR_PATH', 2, start, rnday, True, True, True, box,True)

Sep 27 '22 15:09 SorooshMani-NOAA

@SorooshMani-NOAA This error happens when the requested data is not available yet on aws.

Sep 30 '22 14:09 cuill

@SorooshMani-NOAA I would suggest you use gfs2 and hrrr3 because there are issues with original gfs/hrrr data.

Below are plots for hrrr and hrrr3. As you can see, hrrr_orig puts data from the curvilinear grid on the rectilinear grid, which results in the white area with missing values (very large values). SCHISM cannot recognize missing values and take them as input.

hrrr3

I cannot remember the issues for original gfs. After comparing it with Dan's sflux, there are some big errors between the two datasets. That's the reason I switched to the original grib2 format.

Sep 30 '22 14:09 cuill

@cuill, that's true, but the forecast data is actually available. For the 6-hour cycles, (about 2 hour after the actual cycle) the data should be available on AWS, just like NOMADS. However unlike old hrrr.HRRR implementation, the hrrr3 doesn't look for the forecasts stamps. It just tries to find the files with specific date & time stamp (as far as I understand).

So if I want to forecast for 2 days, I need to look at a 6-hour cycle which has forecast into the future. This is what I'm doing in the script above (snippet below):

now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()

start = (last_cycle - timedelta(days=1)).replace(hour=0)
end = last_cycle + timedelta(days=2)

To quote myself from the ticket you don't have access to:

...it seems that on cycles 00, 06, 12, 18 there's a 48 hour extended forecast (I didn't know this!). The same thing appears to be true in case of BDP S3 buckets for 6-hour cycles...

But hrrr3 implementation just looks at the date stamp (e.g. 20220930 and t13z in noaa-hrrr-bdp-pds/hrrr.20220930/conus/hrrr.t13z.wrfsfcf01.grib2) not the forecast stamp (e.g. f01 in the same file). For HRRR hourly cycles for each time stamp (e.g. 20220930 and t13z) there are 19 forecast files (f00, f01, ..., f18) and for cycles 00, 06, 12, 18 (e.g. 20220930 and t00z or t06z or ...) there are 49 (f00, f01, ..., f48). hrrr.HRRR correctly fetches this data, but hrrr3 doesn't

Sep 30 '22 14:09 SorooshMani-NOAA

I would suggest you use gfs2 and hrrr3 because there are issues with original gfs/hrrr data.

@cuill thanks for this information. I would really like to use hrrr3. That's the first thing I tried, but due to this issue I described in the previous comment, I cannot. I manually went to the S3 bucket and checked, the forecast file was there, but hrrr3 wouldn't pick it up, and I assumed it's because it's not looking at the forecast stamps

Sep 30 '22 14:09 SorooshMani-NOAA

Here it seems to be just using timedate stamp: https://github.com/schism-dev/pyschism/blob/96e52fd54ff9beacc5e30e7fe2821989d0b67bd4/pyschism/forcing/nws/nws2/hrrr3.py#L36-L60

Sep 30 '22 15:09 SorooshMani-NOAA

@SorooshMani-NOAA With your script, rnday is 3.5 days. HRRR has 49 forecast records.

hrrr3 considered both hindcast and forecast scenarios: For hindcast, rnday is the real run period and set record = 1 (days), which will generate rnday files with each one having 24 timestamps. For forecast, if you start from today, set rnday=1, record=2 (days), which will generate one file with 48 timestamps.

As suggested by Dan, hrrr3 can take forecast at t00z, t06z, t12z, t18z, which has 49 points, covering longer than t01z, t02z, t03z, etc. which only have 19 points.

For example, if your nearest cycle is 2022093000, it will generate hrrr_2022093000.nc, likewise, if your nearest cycle is 2022093006, it will generate hrrr_2022093006.nc.

Sep 30 '22 15:09 cuill

Here is the script I just used to generate hrrr_2022093006.nc

now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()

print(last_cycle)

rnday = 1
record = 2
hgrid = Hgrid.open('../../static/hgrid.gr3', crs='epsg:4326')

pscr='/sciclone/pscr/lcui01/HRRR/'
hrrr = HRRR(start_date=last_cycle, rnday=rnday, pscr=pscr, record=record, bbox=hgrid.bbox)
print(f'It took {(time()-t0)/60} mins to process {record*24} files/day')

Sep 30 '22 15:09 cuill

@cuill, thank you, for the information. So are you saying to resolve my issue for a rnday=3.5 run, I need to use record=1 for hindcast with rnday=1.5. After that I need to call hrrr3 with record=2 for the forecast part with rnday=1 and it will take 2 days of forecast automatically because my cycle matches the 49 point forecast?

I'm OK with doing this in my scripts, but this is very implicit and opaque to a new user. I think, from a usability perspective of pyschism it makes more sense to take rnday into account internally so that if I need 2 day of forecast I pass rnday=2 instead of 1. In my opinion, a user doesn't need to know the details of the HRRR or GFS, etc.; they just need to know for what period they need data and how long their simulation is. Passing 1 instead of 2 for forecast section implies that user knows enough about how HRRR data is organized. This defeats the purpose of having pyschism abstraction.

Also it might be beneficial if pyschism can automatically distinguish forecast vs hindcast so that I don't need to make multiple calls to hrrr3.HRRR to get both segments of the simulation.

Maybe this is not a high priority since "this just works" if I know what to do with it, but I think it's an important improvement for new users down the line.

Sep 30 '22 17:09 SorooshMani-NOAA

With your script, it generates start and today as follows:

last_cycle is 2022-09-30 12:00:00 start is 2022-09-29 00:00:00 rnday is 3.5

So you can set rnday = 2, because only two days (2022-09-29, 2022-09-30) are available during your cycle, and set record = 2, then it will generate hrrr_2022092900.nc and hrrr_2022093000.nc, both has 48 timestamps. The remaining 1.5 days will be covered by sflux1.

Sep 30 '22 18:09 cuill

No, I would say the remaining 0.5 days will be covered by sflux1, 1 day covered by hrrr_2022092900.nc, and 2 days covered by hrrr_2022093000.nc

Sep 30 '22 18:09 cuill

I think there's a miscommunication. Let's say I want to do a simulation which is hindcast + forecast. Let's assume today is 9/30, the time is past 14 Zulu and I have:

start is 2022-09-29 00:00:00
rnday is 3.5

I'd like to setup the simulation using HRRR (from hrrr3) + GFS (either gfs or gfs2 implementation). How should I approach it? I thought the script that I have at the top (ticket description) would be sufficient for setting up the HRRR component. Based on what you're saying I need to do something else. Can you please write a short snippet to show me how to achieve this?

Thanks!

Sep 30 '22 19:09 SorooshMani-NOAA

Here is the code to generate sflux1 from gfs2:

**now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()
start = (last_cycle - timedelta(days=1)).replace(hour=0)
rnday = 2 
record = 3  
hgrid = Hgrid.open('../../static/hgrid.gr3', crs='epsg:4326')
pscr = '/sciclone/pscr/lcui01/GFS/'
GFS(start_date=start, rnday=rnday, pscr=pscr, record=record, bbox=hgrid.bbox)**

This will generate 20220929/gfs_2022092900.nc 20220930/gfs_2022093000.nc

You'll link these two files to your sflux, 20220929/gfs_2022092900.nc -> sflux_air_1.0001.nc 20220929/gfs_2022092900.nc -> sflux_prc_1.0001.nc 20220929/gfs_2022092900.nc -> sflux_rad_1.0001.nc 20220930/gfs_2022093000.nc -> sflux_air_1.0002.nc 20220930/gfs_2022093000.nc -> sflux_prc_1.0002.nc 20220930/gfs_2022093000.nc -> sflux_rad_1.0002.nc

For sflux2 from hrrr3, the only change is record: now = datetime.now() last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist() start = (last_cycle - timedelta(days=1)).replace(hour=0) rnday = 2 record = 2 hgrid = Hgrid.open('../../static/hgrid.gr3', crs='epsg:4326') pscr = '/sciclone/pscr/lcui01/HRRR/' HRRR(start_date=start, rnday=2, pscr=pscr, record=record, bbox=hgrid.bbox)

This will generate: 20220929/hrrr_2022092900.nc 20220930/hrrr_2022093000.nc

You'll link these two files to your sflux, 20220929/hrrr_2022092900.nc -> sflux_air_2.0001.nc 20220929/hrrr_2022092900.nc -> sflux_prc_2.0001.nc 20220929/hrrr_2022092900.nc -> sflux_rad_2.0001.nc 20220930/hrrr_2022093000.nc -> sflux_air_2.0002.nc 20220930/hrrr_2022093000.nc -> sflux_prc_2.0002.nc 20220930/hrrr_2022093000.nc -> sflux_rad_2.0002.nc

Per the image below, SCHISM will read data as marked by the red color:

Sep 30 '22 21:09 cuill

@cuill thank you for your detailed explanation.

I have some follow up questions:

What is the role of record vs rnday arguments?

If I remember correctly the rnday in SCHISM and other locations in pyschism mean the number of simulation days. However here in your script it is specified as 2 as opposed to 3.5 (assuming I'm on hour-12 cycle, so I'm doing 1.5 day hindcast and 2 days forecast).

Is 2 the number of GFS or HRRR files needed for this simulation?
Does rnday mean a different thing in hrrr3 and gfs2 implementations vs other places in pyschism?
Also does record in this case mean the number of simulation days in forecast mode?

Either way, as I said earlier, I think it's better to abstract this so that without the requirement for exact knowledge, the user can just do

import pandas as pd
import numpy as np
from matplotlib.transforms import Bbox
from datetime import datetime, timedelta
from pyschism.forcing.nws.nws2 import hrrr3

box = Bbox([[-70, 44], [-69, 45]])

now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()

start = (last_cycle - timedelta(days=1)).replace(hour=0)
end = last_cycle + timedelta(days=2)

rndelta = end -start
rnday = rndelta.total_seconds() / 86400.

hrrr3.HRRR(start_date=start, rnday=rnday, bbox=box)
gfs2.GFS(start_date=start, rnday=rnday, bbox=box)

and get the correct files. In my opinion having rnday and start_date should be sufficient to implement the logic inside pyschism to pick the right files for GFS and HRRR, without any additional information required from the user.

Ideally I would also think about abstracting the part where we need to link files to the actual sflux location as well.

Oct 03 '22 13:10 SorooshMani-NOAA

@SorooshMani-NOAA Thanks for your suggestion. I'll add it to the to-do list but I don't have ETA at the moment. I have other priorities to do. You are welcome to open a PR.

Oct 03 '22 14:10 cuill

@cuill Sure, I understand there are other priorities. I just wanted to document our discussions and ideas. For now I don't have any immediate need for it, your scripts will help me with my use case. Thank you!

Oct 03 '22 14:10 SorooshMani-NOAA

pyschism pyschism copied to clipboard

`hrrr3.HRRR` fails when fetching 2 day forecast

pyschism
pyschism copied to clipboard