echopype Allow handling larger-than-memory total data volume in `combine

In one of our echopype-examples example notebooks (the OOI one), the kernel would die if local memory is not large enough to hold the total volume of data from all files. In theory everything can be lazy-loaded within the combined EchoData object so that this doesn't happen.

Jul 28 '22 05:07 leewujung

@leewujung and myself just discussed a possible solution to this.

The first change we want to make is that combine_echodata has a required input, say output_path that says where the final form of the combined echodata objects should be stored. Additionally, we should require that this be a zarr file.

With this assumption, we have three scenarios for our input:

all delayed echodata objects (read from zarr files or netcdf files)
- In this scenario, we can simply do an xarray combine_nested (or some equivalent), write these combined objects to output_path, and we can return the delayed object produced by combine_nested or read the combined objects from output_path (using open_converted) and return the delayed object (I prefer the former, if possible)
all in-memory echodata objects (i.e. in RAM)
- Here we will write these objects to output_path using a similar process to my work done in PR #774. Then, we will use open_converted using output_path and return the delayed object.
a mix of delayed and in-memory echodata objects
- In this case we will separate the delayed objects from those that are not delayed. We will use combine_nested on the delayed and write them to output_path and then append the in-memory echodata objects to output_path.

Aug 19 '22 00:08 b-reyes

all in-memory echodata objects (i.e. in RAM)

Here we will write these objects to output_path using a similar process to my work done in PR Add functionality that directly writes variables to a temporary zarr store #774. Then, we will use open_converted using output_path and return the delayed object.

To clarify what I meant in #797 (below copy-pasting relevant section to avoid overhead to find things later):

What I think need work is when the list of EchoData objects are all in memory. In this case ~they need to combined, chunked, write to disk~ I think each of the groups can be parallel write into zarr files and then lazy-loaded back to be represented in the combined EchoData object using the same mechanism as the case with all lazy-loaded EchoData objects. Re-chunking can be delayed along with the writing-to-zarr and compute at once.

@leewujung I think this option is the easiest to implement (and I prefer it), however, I thought that you did not want to write all the in-memory objects to zarr as this may result in an increase in cost on the cloud. If you now think this is a viable option, could you please state this in issue #766? I thought we had arrived at something different for this situation.

Here's what I am proposing (and what I have always thought!):

write in-memory data into the final zarr location in parallel since we know the coordinate indices for all dimensions -- this is why we ask the user to supply the final zarr location in the first place. If you delay open_mfdataset with chunking and write to zarr, the writing is distributed (I have tested this). The "problem" right now is that when data are in memory the computations are not delayed automatically I don't think -- try delay them explicitly and see if the to_zarr invokes distributed save automatically.
lazy-load data from the final zarr object to be represented in the EchoData object (this step is the same as the all-read-from-disk scenario)

Aug 30 '22 17:08 leewujung

Seems like we are ready to close this? @b-reyes @lsetiawan

Oct 13 '22 01:10 leewujung

Seems like we are ready to close this? @b-reyes @lsetiawan

I believe the recent fixes to combine_echodata address this issue.

Oct 13 '22 16:10 b-reyes

From discussion: The last thing to do is to check that this works on binder.

Oct 13 '22 16:10 leewujung

This works on binder even with its limited memory, closing now!

Oct 15 '22 21:10 leewujung

Allow handling larger-than-memory total data volume in `combine_echodata`