Allow handling larger-than-memory total data volume in `combine_echodata`
In one of our echopype-examples example notebooks (the OOI one), the kernel would die if local memory is not large enough to hold the total volume of data from all files. In theory everything can be lazy-loaded within the combined EchoData object so that this doesn't happen.
@leewujung and myself just discussed a possible solution to this.
The first change we want to make is that combine_echodata has a required input, say output_path that says where the final form of the combined echodata objects should be stored. Additionally, we should require that this be a zarr file.
With this assumption, we have three scenarios for our input:
- all delayed echodata objects (read from zarr files or netcdf files)
- In this scenario, we can simply do an xarray
combine_nested(or some equivalent), write these combined objects tooutput_path, and we can return the delayed object produced bycombine_nestedor read the combined objects fromoutput_path(usingopen_converted) and return the delayed object (I prefer the former, if possible)
- In this scenario, we can simply do an xarray
- all in-memory echodata objects (i.e. in RAM)
- Here we will write these objects to
output_pathusing a similar process to my work done in PR #774. Then, we will useopen_convertedusingoutput_pathand return the delayed object.
- Here we will write these objects to
- a mix of delayed and in-memory echodata objects
- In this case we will separate the delayed objects from those that are not delayed. We will use
combine_nestedon the delayed and write them tooutput_pathand then append the in-memory echodata objects tooutput_path.
- In this case we will separate the delayed objects from those that are not delayed. We will use
all in-memory echodata objects (i.e. in RAM)
- Here we will write these objects to
output_pathusing a similar process to my work done in PR Add functionality that directly writes variables to a temporary zarr store #774. Then, we will useopen_convertedusingoutput_pathand return the delayed object.
To clarify what I meant in #797 (below copy-pasting relevant section to avoid overhead to find things later):
What I think need work is when the list of EchoData objects are all in memory. In this case ~they need to combined, chunked, write to disk~ I think each of the groups can be parallel write into zarr files and then lazy-loaded back to be represented in the combined
EchoDataobject using the same mechanism as the case with all lazy-loaded EchoData objects. Re-chunking can be delayed along with the writing-to-zarr and compute at once.@leewujung I think this option is the easiest to implement (and I prefer it), however, I thought that you did not want to write all the in-memory objects to zarr as this may result in an increase in cost on the cloud. If you now think this is a viable option, could you please state this in issue #766? I thought we had arrived at something different for this situation.
Here's what I am proposing (and what I have always thought!):
- write in-memory data into the final zarr location in parallel since we know the coordinate indices for all dimensions -- this is why we ask the user to supply the final zarr location in the first place. If you delay open_mfdataset with chunking and write to zarr, the writing is distributed (I have tested this). The "problem" right now is that when data are in memory the computations are not delayed automatically I don't think -- try delay them explicitly and see if the to_zarr invokes distributed save automatically.
- lazy-load data from the final zarr object to be represented in the EchoData object (this step is the same as the all-read-from-disk scenario)
Seems like we are ready to close this? @b-reyes @lsetiawan
Seems like we are ready to close this? @b-reyes @lsetiawan
I believe the recent fixes to combine_echodata address this issue.
From discussion: The last thing to do is to check that this works on binder.
This works on binder even with its limited memory, closing now!