pangeo-forge-recipes
pangeo-forge-recipes copied to clipboard
PRISM dataset zip files containing BIL format files cannot be opened with xarray
The PRISM (https://prism.oregonstate.edu/) product provides monthly modeled climate observations for the continental United States on a 4-km grid spacing for 1895 to present. The source files are in the BIL format (see https://prism.oregonstate.edu/formats/) and are available via FTP.
I have a set of python scripts that I've used to periodically download the PRISM data and create a set of netCDF files. I thought it would useful to create a pangeo-forge recipe for this product.
The source data is organized in a directory structure by variable and year like the following:
ftp://prism.nacse.org/monthly/{variable}/{year}
Within each directory is a zip archive (PRISM_{variable}_stable_4kmM2_{year}_all_bil.zip
) containing 12 months of files in the BIL format. Each BIL is composed of up to seven files:
PRISM_ppt_stable_4kmM2_200004_bil.bil
PRISM_ppt_stable_4kmM2_200004_bil.bil.aux.xml
PRISM_ppt_stable_4kmM2_200004_bil.hdr
PRISM_ppt_stable_4kmM2_200004_bil.info.txt
PRISM_ppt_stable_4kmM2_200004_bil.prj
PRISM_ppt_stable_4kmM2_200004_bil.stn.csv
PRISM_ppt_stable_4kmM2_200004_bil.stx
Xarray is able to open BIL files using the rasterio engine with something like:
ds = xr.open_dataset('PRISM_ppt_stable_4kmM2_200004_bil.bil', engine='rasterio')
I came up with a basic function to to get the path of each BIL given a variable name and date. Each yearly zip file contains multiple months of BIL files so I return a URL into the zip file for a given month.
def make_full_path(variable, time):
url = 'prism.nacse.org'
yyyymm = time.strftime('%Y%m')
re_mask = re.compile(r'^PRISM_.*_stable_4kmM\d_\d\d\d\d_all_bil.zip')
file_glob = f'PRISM_ppt_stable_4kmM*_{yyyymm}*'
# Log into host
ftp = FTP(url)
ftp.login()
cdir = f'monthly/{variable}/{time.year}'
ftp.cwd(cdir)
# Find first file that matches re_mask
# - the _4kmM#_ part of the filename can change
for xx in ftp.nlst():
cfile = re_mask.search(xx)
if cfile is not None:
file_url = f'ftp://{url}/{cdir}/{cfile.group(0)}'
fs = ZipFileSystem(file_url)
filename = fs.glob(file_glob)[0]
return f'zip://{filename}::{file_url}'
For example:
make_full_path('ppt', datetime.datetime(1980, 1, 1))
returns:
zip://PRISM_ppt_stable_4kmM2_198001_bil.bil::ftp://prism.nacse.org/monthly/ppt/1980/PRISM_ppt_stable_4kmM2_1980_all_bil.zip
When I try opening the dataset with fsspec, I get a time out error:
of = fsspec.open(make_full_path('ppt', datetime.datetime(1980, 1, 1)))
ds = xr.open_dataset(of, engine='rasterio')
ds
I think this is happening because rasterio needs more than just the .bil file to open the BIL format. It would appear that pange-forge can't handle this situation without some hook to extract sets of files from the zip prior to opening a single file.
I understand this is outside of what pangeo-forge currently supports. I'm not sure if this type of situation should/could even be supported. At the very least it was quite educational go down this rabbit hole.