iris Poor performance of iris with latest dask

🐛 Bug Report

Reading netcdf and pp files in iris with the latest version of dask is slower and takes a lot more memory.

I'm not familiar enough with the mechanics of iris loading to understand whether this is a dask issue or iris issue. I suspect it may be dask, if so I'd appreciate any help in pinpointing the specific thing I need to raise a bug report with dask about.

The attached script creates a 4D cube full of random numbers and saves to pp and netcdf. It then loads the file and performs the following operations:

load&extract: constrain to single time and lat when loading, then take lon mean:

cube = iris.load_cube(file, t_const & lat_const)
coll = cube.collapsed('longitude', iris.analysis.MEAN)
coll.data  # realise the data

load,extract: take lon mean, then constrain single time and lat:

cube = iris.load_cube(file)
coll = cube.collapsed('longitude', iris.analysis.MEAN)
ext = cube.extract(t_const & lat_const)
ext.data  # realise the data

Both should end up with identical results. Here is the time taken and peak memory usage for iris 3.2 on my machine

netcdf file:

dask version	2021.6	2022.2
load&extract	0.15s, 1MB	0.36s, 204MB
load,extract	0.3s, 204MB	0.34s, 204MB

More than 2x slower
load&extract used to be lazy, now data is realised

pp file:

dask version	2021.6	2022.2
load&extract	5.18s, 13MB	5.76s, 5MB
load,extract	1.81s, 14MB	2.16s, 7MB

both operations slower
why is load then extract faster than load and extract, despite doing far more work? it is taking the mean at all time and lat points but still more than 2x faster. This is not just for the latest dask version, this applies to all versions

How To Reproduce

Steps to reproduce the behaviour:

Download the file
mamba create -y -c conda-forge -n iris-3.2-dask-2021.6 iris=3.2 dask=2021.6.0
run script in environment.
mamba create -y -c conda-forge -n iris-3.2-dask-2022.2 iris=3.2 dask=2022.2.0
run script in environment.

Expected behaviour

Performance should be better in later versions of iris and dask.

Environment

OS & Version: linux, conda environment
Iris Version: 3.2

Analysis script

Expand for full script to cut-paste

from contextlib import contextmanager
import sys
import tracemalloc
import time

import cf_units
import iris
import iris.cube
import numpy as np

def log(msg):
    print(msg, file=sys.stderr)

class PerfMon:
    """A Simple time and memory profiler."""
    def __init__(self):
        self.runs = []
        
    @contextmanager
    def __call__(self, tag: str):
        tracemalloc.start()
        start = time.perf_counter()
        yield
        stop = time.perf_counter()
        _, peak_mem = tracemalloc.get_traced_memory()
        tracemalloc.stop()
        data = (stop - start, peak_mem / 1024**2)
        self.runs.append((tag, data))
    
    def results(self):
        print("{:<20s} {:>12s} {:>12s}".format("tag", "time (s)", "mem (mb)"))
        for tag, data in self.runs:
            print("{:20s} {:12.2f} {:12.0f}".format(tag, *data))



nt = 20
nz = 19
ny = 325
nx = 432


log('. generating data')
c = iris.cube.Cube(
    np.random.normal(size=(nt,nz,ny,nx)), 
    dim_coords_and_dims=[
        (iris.coords.DimCoord(np.arange(nt), 'time', units=cf_units.Unit('days since 1970-01-01 00:00:00', calendar='gregorian')), 0),
        (iris.coords.DimCoord(np.arange(nz), 'height'), 1),
        (iris.coords.DimCoord(np.linspace(-90, 90, ny, endpoint=True), 'latitude'), 2),
        (iris.coords.DimCoord(np.linspace(0., 360, nx, endpoint=False), 'longitude'), 3)
    ],
    standard_name='atmosphere_relative_vorticity'
)

t_point = c.coord('time').cell(4)
t_const = iris.Constraint(time=lambda x: x == t_point)

lat_point = c.coord('latitude').cell(45)
lat_const = iris.Constraint(latitude=lambda x: x == lat_point)


log('. saving test files')
iris.save(c, 'test.pp')
iris.save(c, 'test.nc')

# do some very simple performance tests:
# 1. load the file
# 2. calculate the zonal mean
# 3. realise the data
#
# compare:
#  - pp vs netcdf performance
#  - extracting (t, lat) at load or after collapse 

perf = PerfMon()

log('. perf test 1')
with perf('pp load&extract'):
    cube = iris.load_cube('test.pp', t_const & lat_const)
    coll = cube.collapsed('longitude', iris.analysis.MEAN)
    res = coll
    res.data

log('. perf test 2')
with perf('pp load,extract'):
    cube = iris.load_cube('test.pp')
    coll = cube.collapsed('longitude', iris.analysis.MEAN)
    res = coll.extract(t_const & lat_const)
    res.data
    
log('. perf test 3')
with perf('nc load&extract'):
    cube = iris.load_cube('test.nc', t_const & lat_const)
    coll = cube.collapsed('longitude', iris.analysis.MEAN)
    res = coll
    res.data

log('. perf test 4')
with perf('nc load,extract'):
    cube = iris.load_cube('test.nc')
    coll = cube.collapsed('longitude', iris.analysis.MEAN)
    res = coll.extract(t_const & lat_const)
    res.data
    
perf.results()

May 09 '22 15:05 jamesp

why is load then extract faster than load and extract, despite doing far more work?

@bjlittle explained to me last week that constraints during FF/PP load are applied once to each field, so many very small chunks of memory - suboptimal. By contrast, using extract allows Dask to load optimal sized memory chunks - many fields at once - and apply the constraint once per chunk.

May 09 '22 16:05 trexfeathers

As for Dask: for me this is just more support for #4572 and similar. Dask's default chunk handling frequently changes, AFAICT based on the assumption that users can flexibly adapt the chunking in their scripts. As we know Iris currently buries the chunking control inside the loading automation, risking each Dask improvement becoming a 'breaking performance change'.

May 09 '22 16:05 trexfeathers

That PP load&extract gets a lot quicker if you simplify the latitude constraint

lat_point = c.coord('latitude').cell(45)
lat_const = iris.Constraint(latitude=lat_point.point)

Possibly because of https://github.com/SciTools/iris/blob/caae45075795c1e13fbb0f04488c3b99f15b60e8/lib/iris/_constraints.py#L308-L315

May 21 '22 16:05 rcomer

load&extract used to be lazy, now data is realised

@jamesp where are you seeing this? When I run your code then ask if the cube is lazy it claims to be

May 23 '22 16:05 wjbenfold

I'm seeing it in the memory footprint that is recorded by tracemalloc shown in the table above. The numbers above suggest that a lot more of cube is being realised at some point, or my crude method of measuring memory performance is wrong.

From our recent dash on MeshCoords (#4749) I understand a bit more about how the real/lazy mechanics of iris works: it seems very possible that I could realise some/all of the data in a cube and yet it would still claim to be lazy.

May 23 '22 16:05 jamesp

PP load with various environments, using the user-provided file that prompted @jamesp's concern:

tag                      time (s)     mem (mb)
pp load&extract user        44.44           15
pp load,extract user        16.90           56
(iris-3.2-dask-2021.6)

tag                      time (s)     mem (mb)
pp load&extract user        46.81            8
pp load,extract user        17.45           61
(iris-3.2-dask-2022.2)

tag                      time (s)     mem (mb)
pp load&extract user        26.55            8
pp load,extract user        16.84           71
(default-current)

tag                      time (s)     mem (mb)
pp load&extract user        30.28           14
pp load,extract user        18.32          143
(production-os45-1)

We also worked out earlier that the netCDF file isn't realising everything, it's just pulling the entire chunk into memory (i.e. with a 2GB file it still pulls in about 200MB but doesn't pull the full 2GB)

May 24 '22 16:05 wjbenfold

the netCDF file isn't realising everything, it's just pulling the entire chunk into memory

Oh that's fine then, it's only the worst of both worlds 😆

May 25 '22 08:05 trexfeathers

Oh that's fine then, it's only the worst of both worlds 😆

I don't know, I thought it would have to pull things in at at least the chunk level to read them, so it seems kind of alright? (at least it doesn't pull 2GB into memory - I reckon that's a win! :p)

May 25 '22 13:05 wjbenfold

Oh you're right I misinterpreted this - I'm too ready to hear "poor chunking means the while file is loaded".

May 25 '22 13:05 trexfeathers

the netCDF file isn't realising everything, it's just pulling the entire chunk into memory

Oh that's fine then, it's only the worst of both worlds 😆

I don't know, I thought it would have to pull things in at at least the chunk level to read them, so it seems kind of alright? (at least it doesn't pull 2GB into memory - I reckon that's a win! :p)

Well yes, of course it must load "some few" chunks at some time.
The chunksize default is now configurable via Dask, where it defaults to (I believe) about 100Mb : code here gets the target size from dask default:

>>> import dask.config
>>> import dask.array
>>> dask.config.get("array.chunk-size")
'128MiB'
>>>

In practice, I've seen that you can expect about 3* this.
But I suspect it may not be deterministic how soon it releases memory afterward : it may be ~like the Python garbage collector.

May 25 '22 13:05 pp-mo

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

Oct 08 '23 00:10 github-actions[bot]

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

Mar 02 '25 00:03 github-actions[bot]

iris iris copied to clipboard

Poor performance of iris with latest dask

🐛 Bug Report

How To Reproduce

Expected behaviour

Environment

Analysis script

iris
iris copied to clipboard