xarray
xarray copied to clipboard
open_datatree performance improvement on NetCDF and Zarr files
open_datatree performance improvement on NetCDF files
- [x] Closes #8994 (NetCDF + Zarr datatree)
- [x] Tests added
- [ ] User visible changes (including notable bug fixes) are documented in
whats-new.rst
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. If you have questions, some answers may be found in our contributing guidelines.
Yes very excited by this! Two final things:
- This deserves a
whats-new.rst
entry! - Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file
https://github.com/pydata/xarray/blob/447e5a3d16764a880387d33d0b5938393e167817/asv_bench/benchmarks/dataset_io.py#L125
Alternatively we could leave adding that benchmark to a separate PR?
Sorry for being late to the party, but do we really want to have mode
in open_datatree
? The open_*
functions are for read access only (imho).
Update: It looks like there are already several keyword arguments in the open_*
-functions at least for netcdf4/h5netcdf backends which are only needed for write access (mode
, format
, invalid_netcdf
, clobber
, diskless
, persist
and maybe even more).
In the meeting yesterday we decided to not bother with deduplication for now (and anyways it is nice to have backends somewhat self-contained).
The API will be extended in #9033, the idea is to have one single PR that marks the entire DataTree
API as public.
Edit: and I agree, since open_dataset
in particular already has these parameters, it's fine if open_datatree
also has them (before marking BackendEntrypoint.open_datatree
as public and documenting it we should probably clean it up a bit more)
Hi @TomNicholas,
Thanks for your review and my apologies for not joining today's meeting. My internet connection is quite limited, which is causing some difficulties in staying fully connected.
I will ensure to have some of this addressed by our upcoming meeting on 06/18. However, I have some comments to share.
This is very close! We should try to get it in.
This deserves a whats-new.rst entry!
We don't need to wait for anything here - one of the maintainers can just add this by pushing to this branch.
I can work on this for our next meeting. It won't take to much work.
Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file
I may be able to work on it, but I'm not sure if it will be ready for next week. Let me check it out
I'm fine to leave this to a later PR, we should just create an issue so we remember to add it later. The main purpose is as a performance regression test - we already know this PR is a big improvement!
We shouldn't be adding extra kwargs though - as far as I can tell
open_datatree
should have exactly the same signature asopen_dataset
. See Kai's comment. Again @owenlittlejohns or @flamingbear you can just push to this branch to push it over the finish line. Once that's done I'm happy to approve and merge.
Regarding this last point, I have been using the open_dataset
parameters since the beginning. However, I completely agree with you and @kmuehlbauer about removing those unnecessary parameters. This will be ready as soon as possible.
This looks to be updated, and I would like to try to merge it before next week. The items remaining from @TomNicholas comment are addressed but for the benchmark test that can be done in a separate PR.
The keywords are not exact with the open_dataset, but have removed those related to writing. That seems fine to me, but looking for second opinions via thumbs up. Otherwise I will add a plan to merge
label and wait.
Nice! Merging since this has been thoroughly looked at.
Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again!
Wooooo great work all!!
Thank you @aladinor and all! This is a great contribution.
I was nice to work in this PR. See you all in the next datatree meeting.
Great work Alfonso!