climate_indices icon indicating copy to clipboard operation
climate_indices copied to clipboard

Add a --chunksize option with "inputs" as a choice

Open WeatherGod opened this issue 1 year ago • 7 comments

  • Enabling this allows one to carry the input file's chunksizes through to the output file(s).
  • Closes #476

WeatherGod avatar Sep 26 '22 14:09 WeatherGod

Don't know if we want to give the same treatment to __spi__.py or not. Also, I wonder if we might be painting ourselves into a corner with the command-line option name. What if in the future, we would want the ability to chunk the processing itself from start to finish? Chunksizes for I/O tends to be different from chunksizes for processing.

WeatherGod avatar Sep 26 '22 14:09 WeatherGod

Furthermore, do we want to have a similar ability to carry forward other encoding information like compression and such?

WeatherGod avatar Sep 26 '22 14:09 WeatherGod

Furthermore, do we want to have a similar ability to carry forward other encoding information like compression and such?

So we could have a --preserve-encoding flag which would trigger a snapshot of the data's configuration on the way in, we can then mangle the data as we see fit while processing the indices, then assemble the output in a way that preserves the encoding settings? This almost seems like it should be the default behavior, especially since my plan is to eventually remove the NCO code which makes an attempt to optimize the data before processing. That is the most problematic part of this code (well over 50% of the user help requests are a result of NCO issues, especially on Windows), and we can instead take a less "batteries included" approach and expect users to know how they want their data organized, and do the re-orientation/optimization internally with numpy rather than on the NetCDF itself, though it might require a bit more bookkeeping.

monocongo avatar Sep 26 '22 15:09 monocongo

Indeed. Now that xarray and dask has significantly matured, a lot of the code here can eventually be removed, along with many of its restrictions. I will note that there is an advantage to having an explicit step that specifically reorganizes the data in a manner that is optimal for time-series analysis. The rechunker project (https://rechunker.readthedocs.io/en/latest/) talks about this a fair bit, but unfortunately, it is geared towards ZArr format rather than netcdf.

I'm thinking that --preserve-encoding would be the general, all-around way to preserve such information, while --chunksizes inputs would be the way to do just the chunksizes (or even, in the future, change the chunking on-the-fly). Other arguments could introduce compression options, again, overriding the compression-specific encodings from a --preserve-encoding flag.

WeatherGod avatar Sep 26 '22 16:09 WeatherGod

Here is what the encoding looks like for the PET variable in one of my files as a reference:

{'zlib': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (103, 240, 12), 'source': '.(redacted)/pet.nc', 'original_shape': (721, 1440, 872), 'dtype': dtype('float32'), '_FillValue': nan}

Some of this stuff shouldn't get "preserved" such as the 'source'. 'dtype' is a bit questionable.

WeatherGod avatar Sep 26 '22 17:09 WeatherGod

Other than two test cases that are now somehow off I can approve this. The differences are small/minor from what I can see so far, probably if we compare the new results against the old we won't see anything obvious, but I've not gone that far with it yet. I had trouble getting HDF5/NetCDF etc. installed on the machine where I ran the tests, so I'm still not sure if anything like that may be off causing the minor discrepancies I've seen so far in the pytest output, but the overall approach here for preserving the chunk sizes looks solid.

monocongo avatar Oct 10 '22 16:10 monocongo

Yeah, you're right, I shouldn't let perfect be the enemy of good enough. Is there anyplace else you want this documented? Like a "what's new" or somesuch?

WeatherGod avatar Oct 14 '22 19:10 WeatherGod