xarray icon indicating copy to clipboard operation
xarray copied to clipboard

better handling of invalid files in open_mfdataset

Open vnoel opened this issue 3 years ago • 4 comments

Is your feature request related to a problem?

Suppose I'm trying to read a large number of netCDF files with open_mfdataset.

Now suppose that one of those files is for some reason incorrect -- for instance there was a problem during the creation of that particular file, and its file size is zero, or it is not valid netCDF. The file exists, but it is invalid.

Currently open_mfdataset will raise an exception with the message ValueError: did not find a match in any of xarray's currently installed IO backends

As far as I can tell, there is currently no way to identify which one(s) of the files being read is the source of the problem. If there are several hundreds of those, finding the problematic files is a task by itself, even though xarray probably knows them.

Describe the solution you'd like

It would be most useful to this particular user if the error message could somehow identify the file(s) responsible for the exception.

Apart from better reporting, I would find it very useful if I could pass to open_mfdataset some kind of argument that would make it ignore invalid files altogether (ignore_invalid=False comes to mind).

Describe alternatives you've considered

No response

Additional context

No response

vnoel avatar Jun 29 '22 08:06 vnoel

t would be most useful to this particular user if the error message could somehow identify the file(s) responsible for the exception.

+1. You could make this change to open_dataset and it will be raised with open_mfdataset too. Attempting to read a bad netCDF file is a common source of trouble. So an error saying something like

Reading file XXX failed. The file is possibly corrupted, or the file path is wrong.

would be quite helpful!

I would find it very useful if I could pass to open_mfdataset some kind of argument that would make it ignore invalid files altogether (ignore_invalid=False comes to mind).

This I'm not sure about because a user wouldn't know if they were missing some data in the middle...

dcherian avatar Jun 29 '22 15:06 dcherian

My vote is to have both, a warning, and an option to fill missing data with NaNs. My use case:

I have an archive of 15 years of monthly forecasts. For one month one of the ensemble members is missing. I am converting binary format to zarr. The code is:

ds = xr.open_mfdataset(
    paths,
    engine=BinaryBackend,
    dtype=np.float32,
    combine="nested",
    concat_dim=(ensmem_ix, fcsttime_ix, reftime_ix),
    parallel=False,
).rename_vars(foo="sic")

Currently, my only option is to remove the remaining ensemble member data files before processing. Since I have to use a custom backend (based on https://github.com/aurghs/xarray-backend-tutorial/tree/main), I tried to add code to return array filled with nans when np.fromfile() fails. That, however, is not enough, the missing file is also accessed in _chunk_ds() in xarray/backends/api.py, to create a token for dask. That could be easily handled by adding a try ... except block.

yt87 avatar Jul 09 '23 23:07 yt87

I am also processing multiple files and in between there are some invalid files. It is painful to either remove or ignore these files. I am in support ignore_invalid=False argument. In case of missing data just a warning message would be fine.

It would be great if this enhancement gets implemented.

pratiman-91 avatar Oct 18 '24 01:10 pratiman-91

Contributions welcome!

max-sixty avatar Oct 18 '24 01:10 max-sixty