intake-esm icon indicating copy to clipboard operation
intake-esm copied to clipboard

Proper way to handle failing `preprocess` output.

Open jbusecke opened this issue 3 years ago • 2 comments

I am encountering an issue with one dataset when loading many CMIP6 datasets using intake-esm (see #331).

I believe this is actually an issue with the raw data, but either way it got me curious if there is a way to handle the following scenario properly:

Lets say I have 2 dataset (ds_a,ds_b) in 2 different zarr stores and an appropriately set up intake-esm catalog. Now I have some preprocessing function func.

func modifies something on each datasets, works fine on ds_a, but fails on ds_b. Currently that will lead to a complete failure when reading in the full catalog with .to_datasets_dict().

Is there a way to simply exclude the failing dataset but continue to process only the ones that work? This would be very helpful to me.

EDIT: In further investigating this, it seems that in #331 the preprocessing is not even needed, but I guess this question can be phrased more generally: Is there a way to still output some datasets if errors are coming up for some of them?

jbusecke avatar Apr 09 '21 14:04 jbusecke

@jbusecke,

Yes, I think this is doable. We could make this an optional setting that the user could opt-in. To be sure that this doesn't happen silently, we could raise a warning to let them know which keys failed.

Do you have suggestions on what the API would look like? I am imagining something along these lines:

col.to_dataset_dict(...., errors='ignore')

or

col.to_dataset_dict(...., skip_erroneous_datasets=True)

andersy005 avatar Apr 15 '21 14:04 andersy005

How about skip_errors=True? Its a happy medium?

jbusecke avatar Apr 15 '21 15:04 jbusecke