iris icon indicating copy to clipboard operation
iris copied to clipboard

NetCDF chunking using extra memory with Iris 2.4

Open alastair-gemmell opened this issue 3 years ago • 11 comments

On behalf of Scitools user(s):

I have an Iris/Dask/NetCDF question about chunking data and collapsing dimensions, in this case to compute ensemble statistics (ensemble mean/median/standard deviation fields).

The NetCDF files that i'm using have chunking is optimized for this operation. Rather than using the NetCDF file's chunking, Iris/Dask is putting each ensemble member into its own chunk. I think this is resulting in my code trying to read all of the data into memory at once. Is there a simple way to make make Iris use the NetCDF file's chunking when it loads NetCDF data into a Dask array?

Also, the memory requirement for this and other operations seem to have gone up noticeably when moving from Iris 2.2 to Iris 2.4. I've been busy doubling my memory allocations to allow for it. Running on Iris 2.2 this task took a bit over an hour to compute with 16GB RAM allocated. With Iris 2.4 it no longer fits into that memory or within a 2 hour job.

To recreate issue:

First, I created two Conda installations, one with Iris v2.2 and one with Iris v2.4: conda create --name iris-v2.2 -c conda-forge python=3.7 iris=2.2 memory_profiler conda create --name iris-v2.4 -c conda-forge python=3.7 iris=2.4 memory_profiler

Then, after activating each environment, I ran (python code below): python -m mprof run dask_issue.py mprof plot

For Iris v2.2: Loading the data: Has lazy data: True Chunksize: (1, 1, 36, 72)

Extracting a single year: Has lazy data: True Chunksize: (1, 1, 36, 72)

Realising the data of the single year: Has lazy data: False Chunksize: (9, 12, 36, 72)

For Iris v2.4: Loading the data: Has lazy data: True Chunksize: (1, 2028, 36, 72)

Extracting a single year: Has lazy data: True Chunksize: (1, 12, 36, 72)

Realising the data of the single year: Has lazy data: False Chunksize: (9, 12, 36, 72)

The plots produced by memory_profiler shows the memory usage when realising extracted data using Iris v2.4 is over 5 times the memory usage when realising extracted data using Iris v2.2. The time taken to realise the data also increases. The cube prior to extraction contains data from 1850 to 2018. Is it possible that all the data (rather than just the year that had been extracted) are being realised when using Iris v2.4?

dask_issue.py code referenced above:

#!/usr/bin/env python import iris

def main(): print('Loading the data:') cube = load_cube() print_cube_info(cube)

    print('Extracting a single year:')
    cube = extract_year(cube)
    print_cube_info(cube)

    print('Realising the data of the single year:')
    realise_data(cube)
    print_cube_info(cube)


@profile
def load_cube():
    filename = ('[path]/HadSST4/analysis/HadCRUT.5.0.0.0.SST.analysis.anomalies.?.nc')
    cube = iris.load_cube(filename)
    return cube


@profile
def extract_year(cube):
    year = 1999
    time_constraint = iris.Constraint(
        time=lambda cell: cell.point.year == year)
    cube = cube.extract(time_constraint)
    return cube


@profile
def realise_data(cube):
    cube.data


def print_cube_info(cube):
    tab = ' ' * 4
    print(f'{tab}Has lazy data: {cube.has_lazy_data()}')
    print(f'{tab}Chunksize: {cube.lazy_data().chunksize}')


if __name__ == '__main__':
main()

alastair-gemmell avatar Apr 26 '21 08:04 alastair-gemmell

Looks like NetCDF chunk handling was changed in Iris between v2.2 and v2.4, here: https://github.com/SciTools/iris/pull/3361. That might be a good place to start looking; it's possible that that change had an uncaught consequence, and this is what's reported here.

For the sake of the original reporter...

Is there a simple way to make make Iris use the NetCDF file's chunking when it loads NetCDF data into a Dask array?

Iris already does - this was introduced in #3131.

@alastair-gemmell have you checked what the behaviour is in Iris v3?

Something else that would be interesting to know is what NetCDF reports the dataset chunking as for these problem files. For example:

import netCDF4
ds = netCDF4.Dataset("/path/to/my-file.nc")
print(ds.variables["my-data-var-name"].chunking())

DPeterK avatar Apr 26 '21 10:04 DPeterK

Looks like NetCDF chunk handling was changed in Iris between v2.2 and v2.4, here: https://github.com/SciTools/iris/pull/3361. That might be a good place to start looking; it's possible that that change had an uncaught consequence, and this is what's reported here.

I did some cursory analysis on this problem and I can confirm it stems from the changes made in #3361 - if I substitute the v2.2.0 version of _lazy_data.py into the latest Iris the behaviour replicates the v2.2 behaviour described.

trexfeathers avatar Apr 26 '21 10:04 trexfeathers

I suspect we (IMPROVER) are also effected by this issue, see https://github.com/metoppv/improver/issues/1579. We are seeing an order of magnitude slowdown in some cases due to the unnecessary reading of data :(

cpelley avatar Oct 11 '21 15:10 cpelley

I suspect we (IMPROVER) are also effected by this issue...

Confirmed this issue is what is effecting us - substituting as_lazy_data as per https://github.com/SciTools/iris/issues/4107#issuecomment-826720339 results in considerable performance improvements.

cpelley avatar Oct 12 '21 09:10 cpelley

This is the most frustrating sort of software issue 😆

We made the change to fix some users' known performance issues. But many other users were silent before, because performance was acceptable, but now they are the ones experiencing performance issues.

For us to have been aware before making the change, we would have needed a very comprehensive benchmark suite, and unfortunately that wasn't in place at the time nor is there resource to develop it currently.

The potential solution I'm aware of that offers acceptable performance for everyone is much greater user configurability of chunking, but that doesn't appear to be a simple thing to deliver (#3333).

trexfeathers avatar Oct 12 '21 09:10 trexfeathers

Thanks for getting back to me on this @trexfeathers, appreciated.

The potential solution I'm aware of that offers acceptable performance for everyone is much greater user configurability of chunking, but that doesn't appear to be a simple thing to deliver (#3333).

Yes. I think this has always been my issue with dask - necessarily having to expose such low level parameters to end users :( Seems dask expects you to know what you will be giving it (what form it takes) and what you will be doing with it ahead of time.

Is there no possibility (development path in dask) where dask arrays can be made to genuinely re-chunk the original dask array? (they are currently immutable)

FYI @benfitzpatrick, @BelligerG

cpelley avatar Oct 12 '21 12:10 cpelley

Is there no possibility (development path in dask) where dask arrays can be made to genuinely re-chunk the original dask array? (they are currently immutable)

@cpelley I had thought the root of the problem was the shape of re-chunking. Without knowing the 'correct' chunking shape, wouldn't the quoted idea still manifest the same problem?

trexfeathers avatar Oct 12 '21 12:10 trexfeathers

Possibly some progress : #4572 aims to address #3333

@cpelley Is there no possibility (development path in dask) where dask arrays can be made to genuinely re-chunk the original dask array? (they are currently immutable)

I really don't think this is likely. Chunking as a static ahead-of-time time decision is deeply wired in Dask. Because, I think, its concept of a graph requires that chunks behave like 'delayed' objects -- i.e. they are graph elements representing a black-box task with a single result. So, the chunks make the graph, and different chunks will mean a new graph.

pp-mo avatar Feb 09 '22 12:02 pp-mo

@ehogan & @cpmorice you were the ones that originally brought this to our attention, thanks very much for your vigilance!

We have an open internal support ticket for it, but this has since moved well beyond support - we diagnosed the precise cause, and as you can see here it's difficult to find a resolution that gives everyone acceptable performance.

I'm therefore closing the internal support ticket. I wanted to tag you both here so that you can monitor our latest thoughts on the issue, should you wish.

trexfeathers avatar Sep 14 '22 15:09 trexfeathers

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

github-actions[bot] avatar Jan 28 '24 00:01 github-actions[bot]

Gonna keep this open until the release of 3.8 (#5363), since #5588 should allow users to avoid this problem. Would be good to know if this works in action.

trexfeathers avatar Jan 29 '24 09:01 trexfeathers