One xr.Dataset per ensemble response group / parameter

Open yngve-sk opened this issue 1 year ago • 0 comments

The datasets should be unified to one file for quicker load times, especially for selecting summary vectors. There are several approaches to this:

(easiest, suggestion is to use this first then consider speeding it up later, if it seems like a good idea) Combine netCDFs after generation: Save each response & param to a realization folder named {key}.nc, then use combine_nested after processing to combine them. Pro: Easy to implement, cons: Slowest. For a summary vector slightly bigger than what "real-world" assets use (Troll) it takes about 1 minute to build the unified summary per ensemble (across 400 realizations, 400 timesteps, 10000 keywords). If we can gradually append, it would be a bit less than 1 min faster (increasing write speed per-realization but also skipping the separate combine-step at the end)

(fastest, introduces more complexity): Gradually append to single dataset. A bit more error prone, and prone to concurrency/multiprocessing issues. Though there are ways this could be done.

Zarr is made to support parallel writing from multiple processes, without explicit synchronization, if regions can be specified (assumes uniform dataset size across realizations). But it also creates a pile of files, maybe an issue on NFS/Azure. Tests show appending is only a tiny bit slower than creating individual files. Also, Zarr doesn't really support the concepts of having coordinates to label dimensions, and gives a maybe nonsense error when trying to append stuff along the realization dimension. Upgrading to newer XArray/Zarr would maybe resolve this but we are stuck at python 3.8 at the moment. It could be we would have to "initialize" an empty dataset, i.e., declare the total size before writing to it. This introduces a limitation (that is currently OK but maybe bad for future use if we want varying response/dataset sizes).
NetCDF seems to need some synchronization in order to work (in form of a file lock / process orchestration). However there are mentions of parallel I/O in the NetCDF4 documentation so taking another look at whether this is possible is maybe worth at least a try. It does support the concept of unlimited dimensions, and appending data along them. However, it is not totally clear whether this is for doing it "directly" onto the saved file, which would be a lot quicker if supported. This seems to be not totally supported by XArray, but the netcdf4 package (https://unidata.github.io/netcdf4-python/) could give us access to these more low-level features and maybe take advantage of this. The biggest "issue" when appending in XArray is that it overwrites the dimensions, as we expand the size of the realization dimension once for every append. It also seems that the concept of appending along an unlimited dimension is a very normal use case and should be possible, but not totally out-of-the-box trivial to achieve.