xarray open_datatree performance improvement on NetCDF and Zarr files

open_datatree performance improvement on NetCDF files

[x] Closes #8994 (NetCDF + Zarr datatree)
[x] Tests added
[ ] User visible changes (including notable bug fixes) are documented in whats-new.rst

May 07 '24 19:05 aladinor

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. If you have questions, some answers may be found in our contributing guidelines.

May 07 '24 19:05 welcome[bot]

Yes very excited by this! Two final things:

This deserves a whats-new.rst entry!
Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file

https://github.com/pydata/xarray/blob/447e5a3d16764a880387d33d0b5938393e167817/asv_bench/benchmarks/dataset_io.py#L125

Alternatively we could leave adding that benchmark to a separate PR?

Jun 04 '24 22:06 TomNicholas

Sorry for being late to the party, but do we really want to have mode in open_datatree? The open_* functions are for read access only (imho).

Update: It looks like there are already several keyword arguments in the open_*-functions at least for netcdf4/h5netcdf backends which are only needed for write access (mode, format, invalid_netcdf, clobber, diskless, persist and maybe even more).

Jun 05 '24 06:06 kmuehlbauer

In the meeting yesterday we decided to not bother with deduplication for now (and anyways it is nice to have backends somewhat self-contained).

The API will be extended in #9033, the idea is to have one single PR that marks the entire DataTree API as public.

Edit: and I agree, since open_dataset in particular already has these parameters, it's fine if open_datatree also has them (before marking BackendEntrypoint.open_datatree as public and documenting it we should probably clean it up a bit more)

Jun 05 '24 09:06 keewis

Hi @TomNicholas,

Thanks for your review and my apologies for not joining today's meeting. My internet connection is quite limited, which is causing some difficulties in staying fully connected.

I will ensure to have some of this addressed by our upcoming meeting on 06/18. However, I have some comments to share.

This is very close! We should try to get it in.

This deserves a whats-new.rst entry!

We don't need to wait for anything here - one of the maintainers can just add this by pushing to this branch.

I can work on this for our next meeting. It won't take to much work.

Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file

I may be able to work on it, but I'm not sure if it will be ready for next week. Let me check it out

I'm fine to leave this to a later PR, we should just create an issue so we remember to add it later. The main purpose is as a performance regression test - we already know this PR is a big improvement!

We shouldn't be adding extra kwargs though - as far as I can tell open_datatree should have exactly the same signature as open_dataset. See Kai's comment. Again @owenlittlejohns or @flamingbear you can just push to this branch to push it over the finish line. Once that's done I'm happy to approve and merge.

Regarding this last point, I have been using the open_dataset parameters since the beginning. However, I completely agree with you and @kmuehlbauer about removing those unnecessary parameters. This will be ready as soon as possible.

Jun 12 '24 01:06 aladinor

This looks to be updated, and I would like to try to merge it before next week. The items remaining from @TomNicholas comment are addressed but for the benchmark test that can be done in a separate PR.

The keywords are not exact with the open_dataset, but have removed those related to writing. That seems fine to me, but looking for second opinions via thumbs up. Otherwise I will add a plan to merge label and wait.