iris icon indicating copy to clipboard operation
iris copied to clipboard

Dask chunking control for netcdf loading.

Open pp-mo opened this issue 3 years ago • 2 comments

Aims to address #3333

N.B. there is a reference docs-build here that shows the API

Still a bit preliminary, testing and documentation probably need more doing. But here it is for discussion. I basically set out to do this because I couldn't stop thinking about it. @cpelley @TomekTrzeciak @hdyson

I will also add some comments in the nature of a discussion... Some of that as pointers for things we might want changed, and some as a replacement for missing documentation (?!)

pp-mo avatar Feb 08 '22 20:02 pp-mo

API discussion

The API (reference docs-build here) is actually rather different from what I had expected to end up with. Here are some reasons..

I reasoned that if loading several pieces of data, it will often make sense to align the chunking according to the dimensions. That could be if there are several datacubes wiith same dimensions. But also applies to the cube components, e.g. multiple AuxCoords. The logic of using dimensions is that different AuxCoords (for instance) may not all have the same dimensions, but if arithmetic might be performed between different ones, aligning chunks will make good sense. Whereas, a full 'chunks' argument touching all the dimensions can only apply to variables with a specific set of dimensions. Also, by pinning only certain dimensions, I can get the automatic operation to optimise the sizes in the other dims. E.G. if I had automatic chunks of (1, 1, 100, 1000, 2000) ; if I then pin the 3rd dim to 3 I will then get (1, 33, 3, 1000, 2000) instead, which is still "optimally" sized. (TBD: this code needs properly testing !!)

Assuming that totally dimension-oriented approach is not always correct though, you need a way to give different settings for different variables. Hence the var-name selective behaviour.

It then made sense to me to apply common dimension settings to individual result cubes, which results in the logic shown.

Roads not taken

(1) I had expected to add loader kwargs, and involve the https://github.com/SciTools/iris/pull/3720 However, I instead took ideas from the unstructured data loading control to provide a context-manager approach. One big advantage of this is that you can just invoke existing code, with maybe multiple load calls, without modifying the existing code at all.

(2) It is possible to make settings apply to a filename or file-path (pathlib.PurePath.match seems a good way). This would obviously enable more selective usage when loading from a set of files, which might be needed if different files have vars with the same names. I trialled this, but its only really losing you an automatic merge, whereas it makes for more complexity + explaining, that I'm not sure is worth it.
( Whereas, we can't reasonably drop the var-name selectivity since we can't avoid scanning multiple variables from a single file. )

pp-mo avatar Feb 08 '22 21:02 pp-mo

Usage examples.

This should really go in docs "somewhere" For now, I put usage examples in some stopgap integration tests.

An example that is working for me is, to solve the problem referrred to here

In that specific example ...

  • data was loaded from multiple 1-timestep files
    • with dims (realization: 18; pressure: 33; latitude: 960; longitude: 1280)
  • by loading multiple files -- e.g. 10 -- and with the auto-merge in load, we get a single cube
    • with dims (time:10, realization: 18; pressure: 33; latitude: 960; longitude: 1280)
    • which is chunked as (1, 1, 17, 960, 1280)
      • (the '17' is determined by the dask default chunksize).
  • the user then wants only one pressure-level, so cube[:, 0]
    • --> (time:10, realization: 18; latitude: 960; longitude: 1280)
    • chunked as (1, 1, 960, 1280)
  • but attempting to fetch that ~200Mb of data (in ~1Mb chunks) consumes several Gb of data, and can crash the machine
  • .. alternatively, saving it to a file (dask streaming) works but is extremely slow (20x similar data)

When applying CHUNK_CONTROL.set(pressure=1) to this, the chunking of the original merged cube changes from (1, 1, 17, 960, 1280) to (1, 17, 1, 960, 1280) So that a chunk spans several realization points, but only one pressure level. This then works much better.

pp-mo avatar Feb 08 '22 21:02 pp-mo

As noted on parent issue #3333 , xarray already provides a (slightly less sophisticated) chunking control, and could be of use via #4994

pp-mo avatar Oct 27 '22 13:10 pp-mo

We're going to make this happen! But since this PR has gone rather stale, and @pp-mo has a lot of assigned work, we're going to close this PR and aim to apply the code to the latest Iris. See #5398 for the task breakdown.

trexfeathers avatar Jul 28 '23 16:07 trexfeathers