pyart icon indicating copy to clipboard operation
pyart copied to clipboard

BUG: Read_grid can't read output from write_grid

Open jaimeherriott opened this issue 1 year ago • 9 comments
trafficstars

  • Py-ART version: 1.16.0
  • Python version: 3.11.5
  • Operating System: MAC OS

Description

I’m having issues with the time dimension reading in improperly. Details below:

One of the NetCDF files being used can be found here: https://drive.google.com/file/d/1s1Bh2xXuyOXD3MztmC-xFnskBjQFq8Wn/view?usp=sharing

NetCDF files were created using the function pyart.core.Grid.write(), e.g., ncGrid = pyart.map.grid_from_radars(umassxpol, grid_shape=(41,241,241), grid_limits=((0,10000),(0,60000),(-30000,30000)), weighting_function=‘BARNES2’,gatefilters=gf_despeckeld) ncGrid.write(‘V08.nc’)

Running “ncdump -h V08.nc” shows that the time dimension is 1: ncdump -h V08.nc netcdf V08 { dimensions: time = UNLIMITED ; // (1 currently) z = 41 ; y = 241 ; x = 241 ; nradar = 1 ; nradar_str_length = 8 ; …}

However, when we try to read in the file V08.nc using pyart.io.read_grid(‘V08.nc’), the following error message is produced for every variable in the file, and no data are read in. UserWarning: Field ZD skipped due to incorrect shape (2875, 41, 241, 241)

Describe what you were trying to get done. I'm just trying to read in .nc files into a Jupyter Notebook.

Tell us what happened, what went wrong, and what you expected to happen. read_grid seems to think the time dimension is 2875, not 1. I have no idea where the value 2875 comes from, as it doesn’t correspond to any dimension in the original data.

The problem seems to be confined specifically to this data set. The workflow works fine on other data sets from the same radar.

It’s impractical to produce a minimum working example (MWE) for this issue because it would require uploading all the radar files somewhere, and they’re quite voluminous (1.7 G). Sharing a single example file that’s causing the problem (V08.nc, 14 MB) might be more practical.

The output of both ncGrid.time and ncGrid.radar_time only have 1 dimension (i.e. a single time.)

What I Did

Paste the command(s) you ran and the output.
# Where are the data?
CFpath = "/Volumes/depot/rtanama/share/jherrio/woodward_20160523/umassxpol/cfradial/20160524/"
#CFpath = "/Users/rtanama/Desktop/solodata/data/workstation/cases/20070504/radar/umassxpol/cfradial_from_swp_dissertation_clockfix/20070506/"
CFFiles = glob.glob(CFpath + "*.nc")
CFFiles.sort()
CFFiles

# What are the start times of each CF/Radial file?
CFStartTimes = []
for c in CFFiles:
    #print(c.split("/")[-1][-49:-43])
    s = c.split("/")[-1][-49:-43]
    CFStartTime = dt.datetime(2016, 5, 24, int(s[0:2]), int(s[2:4]), int(s[4:6]))
    #print(volStartTime)
    CFStartTimes.append(CFStartTime)
    del CFStartTime, s
CFStartTimes # Print the file start times

ds = xr.load_dataset(CFFiles[0])
ds
If there was a crash, please include the traceback here.

jaimeherriott avatar Jan 29 '24 16:01 jaimeherriott

@jaimeherriott I began to take a look, yeah the file you shared I can only open with xarray open dataset. With that and setting decode_times=False I can read in the data with xarray and see the time having 2875. I'm digging through py-art to see where maybe the time could have been set wrong.

You mentioned that the raw files etc are too large, but can you share what the time variable looks like in the raw file for the problem file?

Gridding with Py-ART creates a np.array of the first radar's time data and that first time in that file and outputs that using netcdf, the radar time variable saved is created using the function which is strange that it's still picking up 2875: https://github.com/ARM-DOE/pyart/blob/8a4fd837a615f0bbb34e028de7ba7a76db0fd666/pyart/map/grid_mapper.py#L207

the time i'm seeing for that is 0.875 so i'm wondering if something in the time units went wrong in that specific file. I'm still digging into the py-art code as well. To see how 2875 could have arised.

zssherman avatar Feb 02 '24 23:02 zssherman

Might be important: we're using the join_radar utility to merge several sweeps into a volume scan prior to gridding. I've excerpted the relevant bits of @jaimeherriott's code below:

    # volInds corresponds to the indices of CFFiles that contain data between the start time of the first volume and the start of the next one
    print(volStartTimes[volNum])
    volInds = np.where([ (c >= volStartTimes[volNum]) & (c < volStartTimes[volNum + 1]) for c in CFStartTimes])[0]
    print(volInds)   # Which indices correspond to these volumes?
    for i in volInds: # Loop over each sweep in the volume:
        print("Reading: ", CFFiles[i])
        if i == volInds.min():   # First sweep in a volume
            umassxpol = pyart.io.read_cfradial(CFFiles[i])
        else:
            tmp = pyart.io.read_cfradial(CFFiles[i])
            umassxpol = pyart.util.join_radar(umassxpol, tmp)
    # Have to set sweep numbers manually; otherwise they will all be 1
    umassxpol.sweep_number['data'] = np.ma.core.MaskedArray(np.arange(len(volInds)), dtype = 'int32')
    # Now, grid the data
    ncGrid = pyart.map.grid_from_radars(umassxpol, grid_shape=(41,241,241),
                            grid_limits=((0,10000),(0,60000),(-30000,30000)),
                            weighting_function='BARNES2',gatefilters=gf_despeckeld)
    ncGrid.write('V%02d.nc'%volNum)

You've probably already figured this out, but the number 2875 is the output of the netCDF4 module function _ncvar_to_dict, which is getting passed up the chain to lines 115-121 of grid_io.py.

rtanamachi avatar Feb 07 '24 16:02 rtanamachi

Along @rtanamachi 's point - are you joining any of these fields prior to building a grid? I suspect this is where the large increase in the time dimension is occurring.

mgrover1 avatar Feb 13 '24 13:02 mgrover1

I looked at the raw output of a few of the .nc files in terminal, and they were all showing time dimensions of 2875.

jaimeherriott avatar Feb 14 '24 15:02 jaimeherriott

@jaimeherriott - do you have the original radar data here? Not gridded?

mgrover1 avatar Feb 14 '24 16:02 mgrover1

@mgrover1 I have sent you an MWE separately via Google Drive so as not to expose the entire data set to the universe. I hope this is okay!

rtanamachi avatar Feb 15 '24 22:02 rtanamachi

Great! Thank you!!

mgrover1 avatar Feb 15 '24 22:02 mgrover1

I am digging into this and found a possible issue - it appears that the times are coming in as unmasked upon io. The fields are dropped, and the times are read in.

I tried the same workflow with existing sample data, and the grid io worked well - writing/reading worked.

More investigation is needed here, and we need better error checking of the input data before going to write out a grid.

My suggestion for now would be as follows:

ds = ncGrid.to_xarray()
ds.to_netcdf("your-gridded-field.nc")

The conversion to xarray works well - without running into the time issue mentioned here. This will still result in a netCDF file with the grid, and I wonder if moving toward this approach instead of netCDF4 directly would be a better option, allowing users to specify their desired output format (ex. Netcdf, Zarr, etc.).

Thanks for raising this issue @rtanamachi @jaimeherriott . I will follow up here once I make more progress debugging the IO. I am out on vacation next week, so it might be a couple weeks until I am able to revisit.

mgrover1 avatar Feb 16 '24 20:02 mgrover1

Thank you for the suggestion @mgrover1 ! I tried the above suggestion and now I have the following error:

KeyError                                  Traceback (most recent call last)
Cell In[19], line 1
----> 1 ncGrid = pyart.io.read_grid('V08.nc', include_fields = ['ZD_corr'])

File ~/anaconda3/envs/marching_cubes_20230919/lib/python3.11/site-packages/pyart/io/grid_io.py:82, in read_grid(filename, exclude_fields, include_fields, **kwargs)
     80 # required reserved variables
     81 time = _ncvar_to_dict(dset.variables["time"])
---> 82 origin_latitude = _ncvar_to_dict(dset.variables["origin_latitude"])
     83 origin_longitude = _ncvar_to_dict(dset.variables["origin_longitude"])
     84 origin_altitude = _ncvar_to_dict(dset.variables["origin_altitude"])

KeyError: 'origin_latitude' 

jaimeherriott avatar Mar 01 '24 19:03 jaimeherriott

Hi @mgrover1 just waking this back up. @rtanamachi says _ncvar_to_dict is part of the netcd4 reader for the error above. Why would it be able to read in time and not origin_latitude, etc?

jaimeherriott avatar Apr 05 '24 17:04 jaimeherriott

@jaimeherriott @rtanamachi - can you please provide the raw cfradial files being used here? This would help with reproducing the issue here

mgrover1 avatar Apr 30 '24 15:04 mgrover1

@mgrover1 I have sent this to your email with a google drive link for access. Let me know if you need anything further. Thanks!

jaimeherriott avatar Apr 30 '24 23:04 jaimeherriott

@jaimeherriott @rtanamachi - I tracked down the issue. The input files you have include latitude/longitude/altitude values at each gate, as opposed to just a single value for the radar.

These latitude/longitude values of the radar object are then used with the gridding to define the origin latitude and origin longitude. This causes issues when time is an unlimited dimension, and those longer arrays (on the order of ~1000) are written to the file, leading to a much longer time dimension.

The fix here would be make sure your latitude/longitude/altitude values for your radar of length 1 (adding the following to your workflow):

umassxpol.latitude["data"] = umassxpol.latitude["data"][:1]
umassxpol.longitude["data"] = umassxpol.longitude["data"][:1]
umassxpol.altitude["data"] = umassxpol.altitude["data"][:1]

We need a more robust check here within the gridding routine/grid object, but this is where the errors are popping up. It does not run into this issue with regular to_xarray() since these origin_ fields are not returned to the dataset object, so you are not writing those to disk.

mgrover1 avatar May 01 '24 20:05 mgrover1