datatree icon indicating copy to clipboard operation
datatree copied to clipboard

Easy way to set the compression level for all dataarrays in a datatree?

Open jwbrooks0 opened this issue 3 years ago • 2 comments

Question Is there an easy/convenient way to set the compression level for all DataArrays in a datatree?
I've been using the solution here when saving xarray.Datasets, but this solution doesn't seem to map conveniently to datatree. Maybe I'm missing something?

Background I often deal with multiple datasets, each on the order of 10 GBs, and this fills up harddrives fast. Typically when I write xarray.Datasets to file, I like to use gzip compression with level around 5, and this tends to reduce the file size by around 50%.

Thanks!

Edit: Or is recommended practice to update the compression info for each DataArray? (The second answer to the above link)

jwbrooks0 avatar Jan 14 '22 17:01 jwbrooks0

Or is recommended practice to update the compression info for each DataArray? (The second answer to the above link)

I'll just say that this is the approach I've been taking.

However, I do think there is some room for improvement to the current api. The challenge of course is that the per-variable or per-dataset encoding dictionary needs to map to tree structure. In practice these nested-dict data structures are unruly to work with and I've found myself finding plenty of functionality in the DataArray.encoding approach.

@jwbrooks0 - I'm curious if you have thoughts on a possible api that meets your use case? Do you agree that a nested-dict of encoding parameters is less than ideal from an end-user perspective?

jhamman avatar Jan 15 '22 05:01 jhamman

Particularly with Datasets, I definitely would prefer to just provide a single command/attribute to set the entire thing. I haven't thought through datatrees very carefully yet, but I think the same thing applies. My main goal of compression is to save space on my harddrive and having a single setting seems easier to me.

I also don't really understand why I would ever want to have different compression levels for individual DataArrays in a Dataset or datatree.

For context, I mostly save data as float32/64 and occasionally int64/32.

jwbrooks0 avatar Jan 15 '22 13:01 jwbrooks0