xarray Open mfdataset enchancement

trafficstars

[x] Closes #6736
[x] User visible changes (including notable bug fixes) are documented in whats-new.rst

Added new argument in open_mfdataset to better handle the invalid files.

errors : {'ignore', 'raise', 'warn'}, default 'raise'
        - If 'raise', then invalid dataset will raise an exception.
        - If 'ignore', then invalid dataset will be ignored.
        - If 'warn', then a warning will be issued for each invalid dataset.

Jan 16 '25 15:01 pratiman-91

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. If you have questions, some answers may be found in our contributing guidelines.

Jan 16 '25 15:01 welcome[bot]

I'm not the expert, but this looks reasonable! Any other thoughts?

Assuming no one thinks it's a bad idea, we would need tests.

Jan 16 '25 20:01 max-sixty

@headtr1ck Thanks for the suggestions. I have added two tests (ignore and warn). Also, while testing, I found that a new argument broke combine="nested" due to invalid ids. I have now modified it to reflect the correct ids, and it is passing the tests. Please review the tests and the latest version.

Jan 19 '25 10:01 pratiman-91

Hi @headtr1ck, I have been thinking about the handling of ids. Current version looks like a patch work (I am not happy with it.). I think we can create ids after removing all the invalid datasets from path1d within the combine==nested block. Please let me know what do you think. Thanks!

Jan 20 '25 01:01 pratiman-91

@max-sixty Can you please go through the PR. Thanks!

Mar 31 '25 02:03 pratiman-91

I'm admittedly much less familiar with this section of the code. nothing seems wrong though!

I think we should bias towards merging, so if no one has concerns then I'd vote to merge

could we fix the errors in the docs?

Mar 31 '25 18:03 max-sixty

It seems like one test failed test_sparse_dask_dataset_repr (xarray.tests.test_sparse.TestSparseDataArrayAndDataset) . It is not related to this PR.

Apr 04 '25 02:04 pratiman-91

@dcherian @max-sixty Can you please help me merge this request? I am not sure what more needs to be done.

Also, every day I am getting this message. Not sure how to resolve this.

May 30 '25 03:05 pratiman-91

can we fix the errors in the docs?

May 30 '25 06:05 max-sixty

can we fix the errors in the docs?

I just rebased the branch. Let us see if this resolves the docs. Ahh.. found the problem.

May 30 '25 06:05 pratiman-91

@max-sixty all tests passed.

May 30 '25 10:05 pratiman-91

@kmuehlbauer Thank you for the suggestions and for fixing the typos. I have made changes based on your suggestions. Please check.

For the tests, I did not write a test for "raise" since it is the default and already being tested by various other tests. If you think it would be beneficial to add another test, then I can write it.

May 30 '25 12:05 pratiman-91

For the tests, I did not write a test for "raise" since it is the default and already being tested by various other tests. If you think it would be beneficial to add another test, then I can write it.

Yes, you are right, this is and was the default. No need to add another test.

May 30 '25 12:05 kmuehlbauer

@pratiman-91 I've thought a bit about the removing of the problematic files for the nested case.

To my understanding we have two corresponding 1D lists (ids and paths1d).

Wouldn't this work to just remember the indices fro the failing files and remove those from the existing lists afterwards? Along the lines:

remove = []
for i, paths in enumerate(paths1d):
   try:
       open(...)
   except:
      if combine="nested":
          remove.append(i)
          
# later in combine nested

if remove:
    for i in sorted(remove, reverse=True):
        del ids[i]
        del paths1d[i]

Update, we could even use remove for the checks. No need for another variable

May 30 '25 13:05 kmuehlbauer

@kmuehlbauer I have thought about this approach before. However, it creates a problem with a 2x2 nested case. Therefore, I have settled for recreating the IDs and paths. You can also check here: https://github.com/pydata/xarray/pull/9955/commits/3bfaaee6e61b782b621d8d5de584c5545f55993c

May 30 '25 13:05 pratiman-91

Thanks for the pointer @pratiman-91. Yes, this has a twist.

I see this is a special case for the nested combine. All our good faith doesn't help, for these cases. If the user provides a 2x2, or NxM nested list or whatever and one or more files are missing in one or more of the sublists the whole thing explodes. So I'd rather go for having "raise" in those cases as having the user see ValueError: The supplied objects do not form a hypercube because sub-lists do not have consistent lengths along dimension X.

Maybe others have some opinion here?

May 30 '25 14:05 kmuehlbauer

@kmuehlbauer @max-sixty Any update on this PR? Thanks!

-Pratiman

Jun 18 '25 15:06 pratiman-91

@headtr1ck Would you mind having one last look here? I'm not able to merge this without your interaction wrt to a requested change. Thanks!

Jul 04 '25 05:07 kmuehlbauer

I'll close and reopen, just to check how this affects the merge rules.

Jul 07 '25 06:07 kmuehlbauer

@headtr1ck Would you mind having one last look here? I'm not able to merge this without your interaction wrt to a requested change. Thanks!

Sorry for the late reply, currently on holidays.

I am not sure about this approach.

For 1D arrays it works fine (even though maybe users want to have the options to have NaNs in positions of missing values, but due to xarrays indexing this should not be an issue).

But for ND nested lists, basically this PR only works if the user supplies an additional file in the place of the broken file. This implies that the user actually knows that one file is broken. So I wonder if this is useful at all?

I think the main use case is the following: [["1.nc", "2.nc"], ["broken.nc", "4.nc"]] Now the question is what should actually happen here... I think the only way out is to fill in the gap with NaNs.

But maybe this is supposed to be an additional PR?

Jul 07 '25 07:07 headtr1ck

Thanks @headtr1ck, I've had similar concerns here: https://github.com/pydata/xarray/pull/9955#discussion_r2115997640.

@pratiman-91's argument was, if the provided nested list would lead to a consistent output (after removal of any breaking file) it should return without erroring out. It's hard to imagine a real life use case, when and how this could happen. @pratiman-91 can you please elaborate your use case here?

I think the main use case is the following: [["1.nc", "2.nc"], ["broken.nc", "4.nc"]] Now the question is what should actually happen here... I think the only way out is to fill in the gap with NaNs.

But maybe this is supposed to be an additional PR?

I'd totally support this approach, although this seems not an easy task. But should go in a follow-up PR.

From my side, this is PR is still a first way forward, giving the user a bit more freedom when opening/combining nested lists of files.

Jul 07 '25 07:07 kmuehlbauer

Thanks, @headtr1ck and @kmuehlbauer.

I think the main use case is the following: [["1.nc", "2.nc"], ["broken.nc", "4.nc"]] Now the question is what should actually happen here... I think the only way out is to fill in the gap with NaNs.

In this case, two things will happen:

If removing broken.nc results in a logically valid dataset, then we proceed with the rest.
If it still doesn't form a valid dataset, an error should be raised.

Filling with NaNs introduces its own set of challenges. For instance, if the first file itself is broken, there's no reliable source for metadata needed to correctly generate NaNs. This approach becomes significantly more complex than simply skipping the broken file and would require a separate PR to handle properly.

Jul 07 '25 07:07 pratiman-91

@headtr1ck

Some minor changes are still required.

I have made changes based on your suggestions.

Another question: what happens now if someone passes a e.g. 2x2 list of files where one is broken?

Because as far as I can tell, if errors="ignore" this file will be silently removed but then later on the dataset cannot be constructed and quite likely will throw an error that will confuse the user.

I agree, that would be the case. An important assumption is that removing the files does not affect the overall validity of the datasets. I think it should be up to the user to use that option.

Jul 09 '25 15:07 pratiman-91

@headtr1ck Can you please review this PR? Thanks!

Aug 06 '25 13:08 pratiman-91

You need to merge main and resolve the conflicts

Aug 07 '25 19:08 headtr1ck

Another question: what happens now if someone passes a e.g. 2x2 list of files where one is broken? Because as far as I can tell, if errors="ignore" this file will be silently removed but then later on the dataset cannot be constructed and quite likely will throw an error that will confuse the user.

I agree, that would be the case. An important assumption is that removing the files does not affect the overall validity of the datasets. I think it should be up to the user to use that option.

Thanks @pratiman-91 for the explanation. For cases where unrelated files sneak into the file list for some reason the enhancements in this PR would really help the user to just get open_mfdataset to work. Without having to examine the file list. Thanks @pratiman-91.

Aug 11 '25 05:08 kmuehlbauer

I'm inclined to merge this, but unsure about the CI failures. I'll restart CI, let's see if this was just intermittent.

Aug 11 '25 05:08 kmuehlbauer

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

Aug 11 '25 06:08 welcome[bot]

Thanks @pratiman-91 for sticking with us and congrats to your first contribution!

Aug 11 '25 06:08 kmuehlbauer

@max-sixty @kmuehlbauer @headtr1ck Thank you very much for help. It was a nice experience and I learned a lot.

Aug 11 '25 07:08 pratiman-91

xarray xarray copied to clipboard

Open mfdataset enchancement

xarray
xarray copied to clipboard