echopype icon indicating copy to clipboard operation
echopype copied to clipboard

Investigate memory usage of `ZarrCombine`

Open b-reyes opened this issue 3 years ago • 1 comments

When first creating the ZarrCombine class, it was found that combining a large number of files could lead to a substantial increase in memory usage. In PR #824 the class variable max_append_chunk_size was created so that we could limit how large the chunks could get and this limitation curbs the increase in memory usage. Currently, we have set max_append_chunk_size=1000, however, no real study has went into this upper bound.

It is important that we do one of the following:

  1. Conduct a study on various file types and obtain a heuristically driven value for this upper bound
  2. Let max_append_chunk_size be an input to combine_echodata
  3. Set an upper bound for the maximum amount of memory per chunk (say 50MiB) and automatically determine max_append_chunk_size based on this limitation for each variable/coordinate.

b-reyes avatar Oct 04 '22 20:10 b-reyes

Perfect that you created this issue! I am currently reviewing #824 and looked at the combined echodata object. Here's an example of one of the data array from the Beam group: Screenshot from 2022-10-04 13-44-15 The new combine_echodata method works great so far, however, the chunk sizes are way to small. You can see that there are 110 chunks for this particular array! That's gonna be a lot of I/O happening. I think there should be a way to specify the chunk size maybe as MB/GB etc rather than just an integer. At the end of the day I can see maybe around 100MB chunk size for a 1GB of array, rather than the current 25MB chunk size. Btw dask.utils.parse_bytes is a great function to parse the string to regular bytes integer.

lsetiawan avatar Oct 04 '22 20:10 lsetiawan

This is now addressed in https://github.com/OSOceanAcoustics/echopype/pull/1042.

leewujung avatar Jun 05 '23 16:06 leewujung