pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

PRISM dataset zip files containing BIL format files cannot be opened with xarray

Open pnorton-usgs opened this issue 2 years ago • 0 comments

The PRISM (https://prism.oregonstate.edu/) product provides monthly modeled climate observations for the continental United States on a 4-km grid spacing for 1895 to present. The source files are in the BIL format (see https://prism.oregonstate.edu/formats/) and are available via FTP.

I have a set of python scripts that I've used to periodically download the PRISM data and create a set of netCDF files. I thought it would useful to create a pangeo-forge recipe for this product.

The source data is organized in a directory structure by variable and year like the following:

ftp://prism.nacse.org/monthly/{variable}/{year}

Within each directory is a zip archive (PRISM_{variable}_stable_4kmM2_{year}_all_bil.zip) containing 12 months of files in the BIL format. Each BIL is composed of up to seven files:

PRISM_ppt_stable_4kmM2_200004_bil.bil
PRISM_ppt_stable_4kmM2_200004_bil.bil.aux.xml
PRISM_ppt_stable_4kmM2_200004_bil.hdr
PRISM_ppt_stable_4kmM2_200004_bil.info.txt
PRISM_ppt_stable_4kmM2_200004_bil.prj
PRISM_ppt_stable_4kmM2_200004_bil.stn.csv
PRISM_ppt_stable_4kmM2_200004_bil.stx

Xarray is able to open BIL files using the rasterio engine with something like:

ds = xr.open_dataset('PRISM_ppt_stable_4kmM2_200004_bil.bil', engine='rasterio')

I came up with a basic function to to get the path of each BIL given a variable name and date. Each yearly zip file contains multiple months of BIL files so I return a URL into the zip file for a given month.

def make_full_path(variable, time):
    url = 'prism.nacse.org'
    yyyymm = time.strftime('%Y%m')
    re_mask = re.compile(r'^PRISM_.*_stable_4kmM\d_\d\d\d\d_all_bil.zip')
    file_glob = f'PRISM_ppt_stable_4kmM*_{yyyymm}*'
    
    # Log into host
    ftp = FTP(url)
    ftp.login()

    cdir = f'monthly/{variable}/{time.year}'
    ftp.cwd(cdir)

    # Find first file that matches re_mask
    #    - the _4kmM#_ part of the filename can change
    for xx in ftp.nlst():
        cfile = re_mask.search(xx)
        
        if cfile is not None:
            file_url = f'ftp://{url}/{cdir}/{cfile.group(0)}'
            fs = ZipFileSystem(file_url)
            filename = fs.glob(file_glob)[0]
            return f'zip://{filename}::{file_url}'

For example:

make_full_path('ppt', datetime.datetime(1980, 1, 1))

returns: zip://PRISM_ppt_stable_4kmM2_198001_bil.bil::ftp://prism.nacse.org/monthly/ppt/1980/PRISM_ppt_stable_4kmM2_1980_all_bil.zip

When I try opening the dataset with fsspec, I get a time out error:

of = fsspec.open(make_full_path('ppt', datetime.datetime(1980, 1, 1)))

ds = xr.open_dataset(of, engine='rasterio')
ds

I think this is happening because rasterio needs more than just the .bil file to open the BIL format. It would appear that pange-forge can't handle this situation without some hook to extract sets of files from the zip prior to opening a single file.

I understand this is outside of what pangeo-forge currently supports. I'm not sure if this type of situation should/could even be supported. At the very least it was quite educational go down this rabbit hole.

pnorton-usgs avatar Jul 16 '22 21:07 pnorton-usgs