cate icon indicating copy to clipboard operation
cate copied to clipboard

save dataset into netCDF with compression

Open JanisGailis opened this issue 8 years ago • 5 comments

Each SST file takes 16MB on disk, but 1GB when uncompressed. Running averaging now results in each monthly time slice being a netCDF dataset on disk, as no compression is applied. It would be beneficial to do compression upon saving, to have the 'uncompress-process-compress' pipeline.

JanisGailis avatar Feb 03 '17 10:02 JanisGailis

@kbernat No need to take immediate action on this. It may be that this will simply happen along as I continue working on daily->monthly averaging. I'll let you know if I need help!

JanisGailis avatar Feb 03 '17 10:02 JanisGailis

Some info about compressing netcdf variables http://unidata.github.io/netcdf4-python/#section9

kbernat avatar Mar 02 '17 16:03 kbernat

For each variable you can set specified compression, like:

variable.encoding.update({'zlib': True, 'complevel': 9}

or specify it as a parameter in dataset.to_netcdf()

dataset.to_netcdf(..., 
                  encoding= { 'var_name' : {'zlib': True, 'complevel': 9}})

kbernat avatar Mar 10 '17 15:03 kbernat

We could add compression control parameters to write_netcdf() operation. This usually also requires providing "reasonable" chunk sizes for large datasets.

forman avatar Sep 21 '17 07:09 forman

I was going to start an issue on this myself but I see it's already here. I think this is an important thing to do. When I wrote out the results of the monthly aggregation of the SST dataset the netcdf file was 240GB(!) but with gzip that came down to 27GB.

kjpearson avatar Jul 19 '18 07:07 kjpearson