cosima-cookbook
cosima-cookbook copied to clipboard
Investigate using "zarr-like" access on existing data files
There is an interesting approach which exposes netCDF4
files as a zarr-like dataset, doing direct reads from the file, bypassing netCDF
/HDF5
libraries completely:
https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685
This leads to extremely good speed-ups: TL;DR a >60x speed up for the example in the above article.
It isn't clear that this will work as well in a non-cloud deployment, but it is certainly worth testing.
The process itself is cumbersome to do, but could relatively easily be incorporated as part of the COSIMA Cookbook back-end and be transparent to users with no change to the API required.
Still reckon this is a good idea, if most of the benefits of zarr
storage are available without the hassle of reincoding hundreds of terabytes of data. I believe the information that is required (chunk size) is already available, so it may be straightforward.