cosima-cookbook icon indicating copy to clipboard operation
cosima-cookbook copied to clipboard

Investigate using "zarr-like" access on existing data files

Open aidanheerdegen opened this issue 3 years ago • 1 comments

There is an interesting approach which exposes netCDF4 files as a zarr-like dataset, doing direct reads from the file, bypassing netCDF/HDF5 libraries completely:

https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685

This leads to extremely good speed-ups: TL;DR a >60x speed up for the example in the above article.

It isn't clear that this will work as well in a non-cloud deployment, but it is certainly worth testing.

The process itself is cumbersome to do, but could relatively easily be incorporated as part of the COSIMA Cookbook back-end and be transparent to users with no change to the API required.

aidanheerdegen avatar Oct 18 '21 00:10 aidanheerdegen

Still reckon this is a good idea, if most of the benefits of zarr storage are available without the hassle of reincoding hundreds of terabytes of data. I believe the information that is required (chunk size) is already available, so it may be straightforward.

aidanheerdegen avatar Jun 20 '22 06:06 aidanheerdegen