xarray
xarray copied to clipboard
Improving performance of open_datatree
What is your issue?
The implementation of open_datatree
works, but is inefficient, because it calls open_dataset
once for every group in the file. We should refactor this to improve the performance, which would fix issues like https://github.com/xarray-contrib/datatree/issues/330.
We discussed this in the datatree meeting, and my understanding is that concretely we need to:
- [ ] Create an asv benchmark for
open_datatree
, probably involving first writing then benchmarking the opening of a special netCDF file that has no data but lots of groups. - [ ] Refactor the
NetCDFDatastore
class to only create oneCachingFileManager
object per file, not one per group, see https://github.com/pydata/xarray/blob/748bb3a328a65416022ec44ced8d461f143081b5/xarray/backends/netCDF4_.py#L406. - [ ] Refactor
NetCDF4BackendEntrypoint.open_datatree
to use an implementation that goes throughNetCDFDatastore
without calling the top-levelxr.open_dataset
again. - [ ] Check the performance of calling
xr.open_datatree
on a netCDF file has actually improved.
It would be great to get this done soon as part of the datatree integration project. @kmuehlbauer I know you were interested - are you willing / do you have time to take this task on?
cc also @mgrover1 , @aladinor, @flamingbear, @owenlittlejohns, @eni-awowale in case I missed anything
Thanks @TomNicholas for adding more traction here. Unfortunately I'm unable to dedicate as much time as needed here in the upcoming 4 weeks. IIUC @aladinor is already working towards a prototype based on https://github.com/pydata/xarray/pull/7437. Please correct me if I'm wrong.
I've myself played with that branch a bit to get familiar with the code, too. I was trying rebasing/refactoring to recent main, fixing some immediate issues to make it work, but did not come far. Too much has changed in that part of the codebase, which makes rebasing a bit of a pain. I'll see if I can at least get something to work over the weekend.
Thanks, @TomNicholas, for putting this together. Indeed, I've been working on the aforementioned steps, and I'd be happy to share some results with you at our next dtree meeting. BTW, When is the next meeting?
see #8747. As a summary, we have a weekly meeting every Tuesday at 11:30 EST.
Too much has changed in that part of the codebase, which makes rebasing a bit of a pain.
Indeed, it may be easier to start again from the current state. The plugin mechanism basically works, but a lot of the details (like the chunk handling) are still missing, and are currently done by the call to open_dataset
.