better handling of invalid files in open_mfdataset
Is your feature request related to a problem?
Suppose I'm trying to read a large number of netCDF files with open_mfdataset.
Now suppose that one of those files is for some reason incorrect -- for instance there was a problem during the creation of that particular file, and its file size is zero, or it is not valid netCDF. The file exists, but it is invalid.
Currently open_mfdataset will raise an exception with the message
ValueError: did not find a match in any of xarray's currently installed IO backends
As far as I can tell, there is currently no way to identify which one(s) of the files being read is the source of the problem. If there are several hundreds of those, finding the problematic files is a task by itself, even though xarray probably knows them.
Describe the solution you'd like
It would be most useful to this particular user if the error message could somehow identify the file(s) responsible for the exception.
Apart from better reporting, I would find it very useful if I could pass to open_mfdataset some kind of argument that would make it ignore invalid files altogether (ignore_invalid=False comes to mind).
Describe alternatives you've considered
No response
Additional context
No response
t would be most useful to this particular user if the error message could somehow identify the file(s) responsible for the exception.
+1. You could make this change to open_dataset and it will be raised with open_mfdataset too. Attempting to read a bad netCDF file is a common source of trouble. So an error saying something like
Reading file XXX failed. The file is possibly corrupted, or the file path is wrong.
would be quite helpful!
I would find it very useful if I could pass to open_mfdataset some kind of argument that would make it ignore invalid files altogether (ignore_invalid=False comes to mind).
This I'm not sure about because a user wouldn't know if they were missing some data in the middle...
My vote is to have both, a warning, and an option to fill missing data with NaNs. My use case:
I have an archive of 15 years of monthly forecasts. For one month one of the ensemble members is missing. I am converting binary format to zarr. The code is:
ds = xr.open_mfdataset(
paths,
engine=BinaryBackend,
dtype=np.float32,
combine="nested",
concat_dim=(ensmem_ix, fcsttime_ix, reftime_ix),
parallel=False,
).rename_vars(foo="sic")
Currently, my only option is to remove the remaining ensemble member data files before processing. Since I have to use a custom backend (based on https://github.com/aurghs/xarray-backend-tutorial/tree/main), I tried to add code to return array filled with nans when np.fromfile() fails. That, however, is not enough, the missing file is also accessed in _chunk_ds() in xarray/backends/api.py, to create a token for dask. That could be easily handled by adding a try ... except block.
I am also processing multiple files and in between there are some invalid files. It is painful to either remove or ignore these files. I am in support ignore_invalid=False argument. In case of missing data just a warning message would be fine.
It would be great if this enhancement gets implemented.
Contributions welcome!