echopype
echopype copied to clipboard
Improve chunking when saving to a file
When we use the routines ed.to_zarr()
or ed.to_netcdf()
we eventually call the function _save_groups_to_file. Within this function we automatically chunk the xarray dataset with respect to coordinates such as range_sample
or ping_time
. However, this is not optimal and the current default settings are not appropriate. For example, this automatic chunking of the entire dataset causes ed['Sonar/Beam_group1'].sample_interval
to have chunk sizes of 58.59 kiB when three channels are present.
Instead of this automatic chunking of the whole dataset, I propose that we instead check that variable size for each dataset and if it is above a certain amount (say 100 MiB), then we chunk it. This chunking should be "intelligent", as we need to take into account all dimensions of the variable and investigate if chunking with respect to range_sample
is necessary. Additionally, if the variable is already chunked, then we should check to make sure the chunking is appropriate and rechunk, if necessary.
I agree that the default chunking is not always necessary and probably makes performance worse in some cases. I think a good juncture to look into this is to consider rechunking when we need to/once we combine multiple data files, since there final combined data would be much larger and chunking will be necessary. Knowing what the approximate total size and dimensions are (the largest dimension usually would be ping_time
) will also help determine the optimal chunk size, especially when we know what the next step of operation is (calibration).
I would also suggest that we spend more time on considering chunking the Sv dataset (ie after the "consolidating" functions we discussed), rather than the raw converted data, since it is on that dataset where the majority of downstream computations will happen/
@lsetiawan : I think this issue has staled with the default chunking thing you added in #939. We will need to revisit the chunking later when we optimize the computations, probably along with the rechunk option. How about closing this since we have new issues tracking the rechunking needs?
I'll close this now since it is outdated.