cosima-cookbook icon indicating copy to clipboard operation
cosima-cookbook copied to clipboard

Should querying.getvar() implement automatic chunking?

Open navidcy opened this issue 4 years ago • 6 comments

xarray v0.16.0 includes a feature .chunk(chunks='auto') (see xarray docs)

I'm wondering whether it would be useful if automatic chunking is applied to the getvar()'s output before returned to user.

@angus-g, @aidanheerdegen

navidcy avatar Jul 14 '20 01:07 navidcy

This defers to dask's auto chunking, which tries to get 128MiB chunks by default, whereas getvar() returns chunks aligned with those on-disk to minimise data shuffling. This is something that would need profiling: is it best to work with large chunks, which may reduce the number of nodes in the task graph by some factor at the cost of possible inter-worker communication to coalesce chunks; or is it best to leave the on-disk chunks as the unit of computation?

angus-g avatar Jul 14 '20 01:07 angus-g

Well these are the sort of "academic" questions I have no idea nor intuition about... ;)

navidcy avatar Jul 14 '20 01:07 navidcy

Yep ... but we still need our high-level dask tutorial, right?

AndyHoggANU avatar Jul 14 '20 01:07 AndyHoggANU

I know @AndyHoggANU; sorry :( Unfortunately I haven't yet reached the point at which I can teach people anything...

navidcy avatar Jul 14 '20 02:07 navidcy

I did automatic chunking on some 0.1° data today (that had been loaded via getvar and thus already had file-aligned chunks). The resulting chunking was doubled along each dimension. For the computation I was doing, it didn't seem to make a big difference, but I don't think it was the prime concern anyway.

angus-g avatar Jul 14 '20 09:07 angus-g

Should we close this?

navidcy avatar Nov 20 '22 20:11 navidcy