pyschism
pyschism copied to clipboard
`hrrr3.HRRR` fails when fetching 2 day forecast
6-hour cycles of HRRR (either on NOMADS or AWS) have 48 hour forecast data. However, when using HRRR in the following script, it fails to download the data:
import pandas as pd
import numpy as np
from matplotlib.transforms import Bbox
from datetime import datetime, timedelta
from pyschism.forcing.nws.nws2 import hrrr3
box = Bbox([[-70, 44], [-69, 45]])
now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()
start = (last_cycle - timedelta(days=1)).replace(hour=0)
end = last_cycle + timedelta(days=2)
rndelta = end -start
rnday = rndelta.total_seconds() / 86400.
hrrr3.HRRR(start_date=start, rnday=rnday, record=2, bbox=box)
with the following error:
RemoteTraceback Traceback (most recent call last) [72/3696]RemoteTraceback:
"""
Traceback (most recent call last):
File "LIBPATH/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "LIBPATH/python3.9/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "REPOPATH/pyschism/forcing/nws/nws2/hrrr3.py", line 129, in gen_sflux
inventory = AWSGrib2Inventory(date, record, pscr)
File "REPOPATH/pyschism/forcing/nws/nws2/hrrr3.py", line 46, in __init__
for obj in page['Contents']:
KeyError: 'Contents'
"""
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In [59], line 1
----> 1 hrrr3.HRRR(start_date=start, rnday=rnday, record=2, bbox=box))
File REPOPATH/pyschism/forcing/nws/nws2/hrrr3.py:122, in HRRR.__init__(self, start_date, rnday, pscr, record, bbox)
119 logger.info(f'npool is {npool}')
120 pool = mp.Pool(npool)
--> 122 pool.starmap(self.gen_sflux, [(date, record, pscr) for date in datevector])
123 # [self.gen_sflux(date, record, pscr) for date in datevector]
125 pool.close()
File LIBPATH/python3.9/multiprocessing/pool.py:372, in Pool.starmap(self, func, iterable, chunksize)
366 def starmap(self, func, iterable, chunksize=None):
367 '''
368 Like `map()` method but the elements of the `iterable` are expected to
369 be iterables as well and will be unpacked as arguments. Hence
370 `func` and (a, b) becomes func(a, b).
371 '''
--> 372 return self._map_async(func, iterable, starmapstar, chunksize).get()
File LIBPATH/python3.9/multiprocessing/pool.py:771, in ApplyResult.get(self, timeout)
769 return self._value
770 else:
--> 771 raise self._value
KeyError: 'Contents'
Note the line numbers might have shifted since I added some print
calls to debug.
This does not fail if I try it with the old implementation of HRRR hrrr.HRRR
. However the limitation from NOMADS is that it only has archive data from the day before:
...
from pyschism.forcing.nws import HRRR
hrrr = HRRR()
hrrr.write('RUNDIR_PATH', 2, start, rnday, True, True, True, box,True)
@SorooshMani-NOAA This error happens when the requested data is not available yet on aws.
@SorooshMani-NOAA I would suggest you use gfs2 and hrrr3 because there are issues with original gfs/hrrr data.
Below are plots for hrrr and hrrr3. As you can see, hrrr_orig puts data from the curvilinear grid on the rectilinear grid, which results in the white area with missing values (very large values). SCHISM cannot recognize missing values and take them as input.
I cannot remember the issues for original gfs. After comparing it with Dan's sflux, there are some big errors between the two datasets. That's the reason I switched to the original grib2 format.
@cuill, that's true, but the forecast data is actually available. For the 6-hour cycles, (about 2 hour after the actual cycle) the data should be available on AWS, just like NOMADS. However unlike old hrrr.HRRR
implementation, the hrrr3
doesn't look for the forecasts stamps. It just tries to find the files with specific date & time stamp (as far as I understand).
So if I want to forecast for 2 days, I need to look at a 6-hour cycle which has forecast into the future. This is what I'm doing in the script above (snippet below):
now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()
start = (last_cycle - timedelta(days=1)).replace(hour=0)
end = last_cycle + timedelta(days=2)
To quote myself from the ticket you don't have access to:
...it seems that on cycles 00, 06, 12, 18 there's a 48 hour extended forecast (I didn't know this!). The same thing appears to be true in case of BDP S3 buckets for 6-hour cycles...
But hrrr3
implementation just looks at the date stamp (e.g. 20220930
and t13z
in noaa-hrrr-bdp-pds/hrrr.20220930/conus/hrrr.t13z.wrfsfcf01.grib2
) not the forecast stamp (e.g. f01
in the same file). For HRRR hourly cycles for each time stamp (e.g. 20220930
and t13z
) there are 19 forecast files (f00
, f01
, ..., f18
) and for cycles 00, 06, 12, 18 (e.g. 20220930
and t00z
or t06z
or ...) there are 49 (f00
, f01
, ..., f48
). hrrr.HRRR
correctly fetches this data, but hrrr3
doesn't
I would suggest you use gfs2 and hrrr3 because there are issues with original gfs/hrrr data.
@cuill thanks for this information. I would really like to use hrrr3
. That's the first thing I tried, but due to this issue I described in the previous comment, I cannot. I manually went to the S3 bucket and checked, the forecast file was there, but hrrr3
wouldn't pick it up, and I assumed it's because it's not looking at the forecast stamps
Here it seems to be just using timedate stamp: https://github.com/schism-dev/pyschism/blob/96e52fd54ff9beacc5e30e7fe2821989d0b67bd4/pyschism/forcing/nws/nws2/hrrr3.py#L36-L60
@SorooshMani-NOAA With your script, rnday is 3.5 days. HRRR has 49 forecast records.
hrrr3 considered both hindcast and forecast scenarios: For hindcast, rnday is the real run period and set record = 1 (days), which will generate rnday files with each one having 24 timestamps. For forecast, if you start from today, set rnday=1, record=2 (days), which will generate one file with 48 timestamps.
As suggested by Dan, hrrr3 can take forecast at t00z, t06z, t12z, t18z, which has 49 points, covering longer than t01z, t02z, t03z, etc. which only have 19 points.
For example, if your nearest cycle is 2022093000, it will generate hrrr_2022093000.nc, likewise, if your nearest cycle is 2022093006, it will generate hrrr_2022093006.nc.
Here is the script I just used to generate hrrr_2022093006.nc
now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()
print(last_cycle)
rnday = 1
record = 2
hgrid = Hgrid.open('../../static/hgrid.gr3', crs='epsg:4326')
pscr='/sciclone/pscr/lcui01/HRRR/'
hrrr = HRRR(start_date=last_cycle, rnday=rnday, pscr=pscr, record=record, bbox=hgrid.bbox)
print(f'It took {(time()-t0)/60} mins to process {record*24} files/day')
@cuill, thank you, for the information. So are you saying to resolve my issue for a rnday=3.5
run, I need to use record=1
for hindcast with rnday=1.5
. After that I need to call hrrr3
with record=2
for the forecast part with rnday=1
and it will take 2 days of forecast automatically because my cycle matches the 49 point forecast?
I'm OK with doing this in my scripts, but this is very implicit and opaque to a new user. I think, from a usability perspective of pyschism
it makes more sense to take rnday
into account internally so that if I need 2 day of forecast I pass rnday=2
instead of 1
. In my opinion, a user doesn't need to know the details of the HRRR or GFS, etc.; they just need to know for what period they need data and how long their simulation is. Passing 1
instead of 2
for forecast section implies that user knows enough about how HRRR data is organized. This defeats the purpose of having pyschism
abstraction.
Also it might be beneficial if pyschism
can automatically distinguish forecast vs hindcast so that I don't need to make multiple calls to hrrr3.HRRR
to get both segments of the simulation.
Maybe this is not a high priority since "this just works" if I know what to do with it, but I think it's an important improvement for new users down the line.
With your script, it generates start and today as follows:
last_cycle is 2022-09-30 12:00:00 start is 2022-09-29 00:00:00 rnday is 3.5
So you can set rnday = 2, because only two days (2022-09-29, 2022-09-30) are available during your cycle, and set record = 2, then it will generate hrrr_2022092900.nc and hrrr_2022093000.nc, both has 48 timestamps. The remaining 1.5 days will be covered by sflux1.
No, I would say the remaining 0.5 days will be covered by sflux1, 1 day covered by hrrr_2022092900.nc, and 2 days covered by hrrr_2022093000.nc
I think there's a miscommunication. Let's say I want to do a simulation which is hindcast + forecast. Let's assume today is 9/30, the time is past 14 Zulu and I have:
- start is 2022-09-29 00:00:00
- rnday is 3.5
I'd like to setup the simulation using HRRR (from hrrr3
) + GFS (either gfs
or gfs2
implementation). How should I approach it? I thought the script that I have at the top (ticket description) would be sufficient for setting up the HRRR component. Based on what you're saying I need to do something else. Can you please write a short snippet to show me how to achieve this?
Thanks!
Here is the code to generate sflux1 from gfs2:
**now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()
start = (last_cycle - timedelta(days=1)).replace(hour=0)
rnday = 2
record = 3
hgrid = Hgrid.open('../../static/hgrid.gr3', crs='epsg:4326')
pscr = '/sciclone/pscr/lcui01/GFS/'
GFS(start_date=start, rnday=rnday, pscr=pscr, record=record, bbox=hgrid.bbox)**
This will generate 20220929/gfs_2022092900.nc 20220930/gfs_2022093000.nc
You'll link these two files to your sflux, 20220929/gfs_2022092900.nc -> sflux_air_1.0001.nc 20220929/gfs_2022092900.nc -> sflux_prc_1.0001.nc 20220929/gfs_2022092900.nc -> sflux_rad_1.0001.nc 20220930/gfs_2022093000.nc -> sflux_air_1.0002.nc 20220930/gfs_2022093000.nc -> sflux_prc_1.0002.nc 20220930/gfs_2022093000.nc -> sflux_rad_1.0002.nc
For sflux2 from hrrr3, the only change is record: now = datetime.now() last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist() start = (last_cycle - timedelta(days=1)).replace(hour=0) rnday = 2 record = 2 hgrid = Hgrid.open('../../static/hgrid.gr3', crs='epsg:4326') pscr = '/sciclone/pscr/lcui01/HRRR/' HRRR(start_date=start, rnday=2, pscr=pscr, record=record, bbox=hgrid.bbox)
This will generate: 20220929/hrrr_2022092900.nc 20220930/hrrr_2022093000.nc
You'll link these two files to your sflux, 20220929/hrrr_2022092900.nc -> sflux_air_2.0001.nc 20220929/hrrr_2022092900.nc -> sflux_prc_2.0001.nc 20220929/hrrr_2022092900.nc -> sflux_rad_2.0001.nc 20220930/hrrr_2022093000.nc -> sflux_air_2.0002.nc 20220930/hrrr_2022093000.nc -> sflux_prc_2.0002.nc 20220930/hrrr_2022093000.nc -> sflux_rad_2.0002.nc
Per the image below, SCHISM will read data as marked by the red color:
@cuill thank you for your detailed explanation.
I have some follow up questions:
- What is the role of
record
vsrnday
arguments?
If I remember correctly the rnday
in SCHISM and other locations in pyschism
mean the number of simulation days. However here in your script it is specified as 2
as opposed to 3.5
(assuming I'm on hour-12 cycle, so I'm doing 1.5 day hindcast and 2 days forecast).
- Is
2
the number of GFS or HRRR files needed for this simulation? - Does
rnday
mean a different thing inhrrr3
andgfs2
implementations vs other places inpyschism
? - Also does
record
in this case mean the number of simulation days in forecast mode?
Either way, as I said earlier, I think it's better to abstract this so that without the requirement for exact knowledge, the user can just do
import pandas as pd
import numpy as np
from matplotlib.transforms import Bbox
from datetime import datetime, timedelta
from pyschism.forcing.nws.nws2 import hrrr3
box = Bbox([[-70, 44], [-69, 45]])
now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()
start = (last_cycle - timedelta(days=1)).replace(hour=0)
end = last_cycle + timedelta(days=2)
rndelta = end -start
rnday = rndelta.total_seconds() / 86400.
hrrr3.HRRR(start_date=start, rnday=rnday, bbox=box)
gfs2.GFS(start_date=start, rnday=rnday, bbox=box)
and get the correct files. In my opinion having rnday
and start_date
should be sufficient to implement the logic inside pyschism
to pick the right files for GFS and HRRR, without any additional information required from the user.
Ideally I would also think about abstracting the part where we need to link files to the actual sflux
location as well.
@SorooshMani-NOAA Thanks for your suggestion. I'll add it to the to-do list but I don't have ETA at the moment. I have other priorities to do. You are welcome to open a PR.
@cuill Sure, I understand there are other priorities. I just wanted to document our discussions and ideas. For now I don't have any immediate need for it, your scripts will help me with my use case. Thank you!