climate_indices
climate_indices copied to clipboard
Add a --chunksize option with "inputs" as a choice
- Enabling this allows one to carry the input file's chunksizes through to the output file(s).
- Closes #476
Don't know if we want to give the same treatment to __spi__.py
or not. Also, I wonder if we might be painting ourselves into a corner with the command-line option name. What if in the future, we would want the ability to chunk the processing itself from start to finish? Chunksizes for I/O tends to be different from chunksizes for processing.
Furthermore, do we want to have a similar ability to carry forward other encoding information like compression and such?
Furthermore, do we want to have a similar ability to carry forward other encoding information like compression and such?
So we could have a --preserve-encoding
flag which would trigger a snapshot of the data's configuration on the way in, we can then mangle the data as we see fit while processing the indices, then assemble the output in a way that preserves the encoding settings? This almost seems like it should be the default behavior, especially since my plan is to eventually remove the NCO code which makes an attempt to optimize the data before processing. That is the most problematic part of this code (well over 50% of the user help requests are a result of NCO issues, especially on Windows), and we can instead take a less "batteries included" approach and expect users to know how they want their data organized, and do the re-orientation/optimization internally with numpy rather than on the NetCDF itself, though it might require a bit more bookkeeping.
Indeed. Now that xarray and dask has significantly matured, a lot of the code here can eventually be removed, along with many of its restrictions. I will note that there is an advantage to having an explicit step that specifically reorganizes the data in a manner that is optimal for time-series analysis. The rechunker
project (https://rechunker.readthedocs.io/en/latest/) talks about this a fair bit, but unfortunately, it is geared towards ZArr format rather than netcdf.
I'm thinking that --preserve-encoding
would be the general, all-around way to preserve such information, while --chunksizes inputs
would be the way to do just the chunksizes (or even, in the future, change the chunking on-the-fly). Other arguments could introduce compression options, again, overriding the compression-specific encodings from a --preserve-encoding
flag.
Here is what the encoding looks like for the PET variable in one of my files as a reference:
{'zlib': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (103, 240, 12), 'source': '.(redacted)/pet.nc', 'original_shape': (721, 1440, 872), 'dtype': dtype('float32'), '_FillValue': nan}
Some of this stuff shouldn't get "preserved" such as the 'source'
. 'dtype'
is a bit questionable.
Other than two test cases that are now somehow off I can approve this. The differences are small/minor from what I can see so far, probably if we compare the new results against the old we won't see anything obvious, but I've not gone that far with it yet. I had trouble getting HDF5/NetCDF etc. installed on the machine where I ran the tests, so I'm still not sure if anything like that may be off causing the minor discrepancies I've seen so far in the pytest output, but the overall approach here for preserving the chunk sizes looks solid.
Yeah, you're right, I shouldn't let perfect be the enemy of good enough. Is there anyplace else you want this documented? Like a "what's new" or somesuch?
I will work on a release to PyPI with a new version and what's new, so if you want to add some notes to that it'll be appreciated, @WeatherGod then stay tuned for the PR.
Thanks again for your effort on this, very appreciated!
Sure, I can add some notes
BTW I pushed a new version to PyPI already, might have jumped the gun