iris
iris copied to clipboard
Poor performance of iris with latest dask
🐛 Bug Report
Reading netcdf and pp files in iris with the latest version of dask is slower and takes a lot more memory.
I'm not familiar enough with the mechanics of iris loading to understand whether this is a dask issue or iris issue. I suspect it may be dask, if so I'd appreciate any help in pinpointing the specific thing I need to raise a bug report with dask about.
The attached script creates a 4D cube full of random numbers and saves to pp and netcdf. It then loads the file and performs the following operations:
load&extract: constrain to single time and lat when loading, then take lon mean:
cube = iris.load_cube(file, t_const & lat_const)
coll = cube.collapsed('longitude', iris.analysis.MEAN)
coll.data # realise the data
load,extract: take lon mean, then constrain single time and lat:
cube = iris.load_cube(file)
coll = cube.collapsed('longitude', iris.analysis.MEAN)
ext = cube.extract(t_const & lat_const)
ext.data # realise the data
Both should end up with identical results. Here is the time taken and peak memory usage for iris 3.2 on my machine
netcdf file:
| dask version | 2021.6 | 2022.2 |
|---|---|---|
| load&extract | 0.15s, 1MB | 0.36s, 204MB |
| load,extract | 0.3s, 204MB | 0.34s, 204MB |
- More than 2x slower
- load&extract used to be lazy, now data is realised
pp file:
| dask version | 2021.6 | 2022.2 |
|---|---|---|
| load&extract | 5.18s, 13MB | 5.76s, 5MB |
| load,extract | 1.81s, 14MB | 2.16s, 7MB |
- both operations slower
- why is load then extract faster than load and extract, despite doing far more work? it is taking the mean at all time and lat points but still more than 2x faster. This is not just for the latest dask version, this applies to all versions
How To Reproduce
Steps to reproduce the behaviour:
- Download the file
mamba create -y -c conda-forge -n iris-3.2-dask-2021.6 iris=3.2 dask=2021.6.0- run script in environment.
mamba create -y -c conda-forge -n iris-3.2-dask-2022.2 iris=3.2 dask=2022.2.0- run script in environment.
Expected behaviour
Performance should be better in later versions of iris and dask.
Environment
- OS & Version: linux, conda environment
- Iris Version: 3.2
Analysis script
Expand for full script to cut-paste
from contextlib import contextmanager
import sys
import tracemalloc
import time
import cf_units
import iris
import iris.cube
import numpy as np
def log(msg):
print(msg, file=sys.stderr)
class PerfMon:
"""A Simple time and memory profiler."""
def __init__(self):
self.runs = []
@contextmanager
def __call__(self, tag: str):
tracemalloc.start()
start = time.perf_counter()
yield
stop = time.perf_counter()
_, peak_mem = tracemalloc.get_traced_memory()
tracemalloc.stop()
data = (stop - start, peak_mem / 1024**2)
self.runs.append((tag, data))
def results(self):
print("{:<20s} {:>12s} {:>12s}".format("tag", "time (s)", "mem (mb)"))
for tag, data in self.runs:
print("{:20s} {:12.2f} {:12.0f}".format(tag, *data))
nt = 20
nz = 19
ny = 325
nx = 432
log('. generating data')
c = iris.cube.Cube(
np.random.normal(size=(nt,nz,ny,nx)),
dim_coords_and_dims=[
(iris.coords.DimCoord(np.arange(nt), 'time', units=cf_units.Unit('days since 1970-01-01 00:00:00', calendar='gregorian')), 0),
(iris.coords.DimCoord(np.arange(nz), 'height'), 1),
(iris.coords.DimCoord(np.linspace(-90, 90, ny, endpoint=True), 'latitude'), 2),
(iris.coords.DimCoord(np.linspace(0., 360, nx, endpoint=False), 'longitude'), 3)
],
standard_name='atmosphere_relative_vorticity'
)
t_point = c.coord('time').cell(4)
t_const = iris.Constraint(time=lambda x: x == t_point)
lat_point = c.coord('latitude').cell(45)
lat_const = iris.Constraint(latitude=lambda x: x == lat_point)
log('. saving test files')
iris.save(c, 'test.pp')
iris.save(c, 'test.nc')
# do some very simple performance tests:
# 1. load the file
# 2. calculate the zonal mean
# 3. realise the data
#
# compare:
# - pp vs netcdf performance
# - extracting (t, lat) at load or after collapse
perf = PerfMon()
log('. perf test 1')
with perf('pp load&extract'):
cube = iris.load_cube('test.pp', t_const & lat_const)
coll = cube.collapsed('longitude', iris.analysis.MEAN)
res = coll
res.data
log('. perf test 2')
with perf('pp load,extract'):
cube = iris.load_cube('test.pp')
coll = cube.collapsed('longitude', iris.analysis.MEAN)
res = coll.extract(t_const & lat_const)
res.data
log('. perf test 3')
with perf('nc load&extract'):
cube = iris.load_cube('test.nc', t_const & lat_const)
coll = cube.collapsed('longitude', iris.analysis.MEAN)
res = coll
res.data
log('. perf test 4')
with perf('nc load,extract'):
cube = iris.load_cube('test.nc')
coll = cube.collapsed('longitude', iris.analysis.MEAN)
res = coll.extract(t_const & lat_const)
res.data
perf.results()
why is load then extract faster than load and extract, despite doing far more work?
@bjlittle explained to me last week that constraints during FF/PP load are applied once to each field, so many very small chunks of memory - suboptimal. By contrast, using extract allows Dask to load optimal sized memory chunks - many fields at once - and apply the constraint once per chunk.
As for Dask: for me this is just more support for #4572 and similar. Dask's default chunk handling frequently changes, AFAICT based on the assumption that users can flexibly adapt the chunking in their scripts. As we know Iris currently buries the chunking control inside the loading automation, risking each Dask improvement becoming a 'breaking performance change'.
That PP load&extract gets a lot quicker if you simplify the latitude constraint
lat_point = c.coord('latitude').cell(45)
lat_const = iris.Constraint(latitude=lat_point.point)
Possibly because of https://github.com/SciTools/iris/blob/caae45075795c1e13fbb0f04488c3b99f15b60e8/lib/iris/_constraints.py#L308-L315
- load&extract used to be lazy, now data is realised
@jamesp where are you seeing this? When I run your code then ask if the cube is lazy it claims to be
I'm seeing it in the memory footprint that is recorded by tracemalloc shown in the table above. The numbers above suggest that a lot more of cube is being realised at some point, or my crude method of measuring memory performance is wrong.
From our recent dash on MeshCoords (#4749) I understand a bit more about how the real/lazy mechanics of iris works: it seems very possible that I could realise some/all of the data in a cube and yet it would still claim to be lazy.
PP load with various environments, using the user-provided file that prompted @jamesp's concern:
tag time (s) mem (mb)
pp load&extract user 44.44 15
pp load,extract user 16.90 56
(iris-3.2-dask-2021.6)
tag time (s) mem (mb)
pp load&extract user 46.81 8
pp load,extract user 17.45 61
(iris-3.2-dask-2022.2)
tag time (s) mem (mb)
pp load&extract user 26.55 8
pp load,extract user 16.84 71
(default-current)
tag time (s) mem (mb)
pp load&extract user 30.28 14
pp load,extract user 18.32 143
(production-os45-1)
We also worked out earlier that the netCDF file isn't realising everything, it's just pulling the entire chunk into memory (i.e. with a 2GB file it still pulls in about 200MB but doesn't pull the full 2GB)
the netCDF file isn't realising everything, it's just pulling the entire chunk into memory
Oh that's fine then, it's only the worst of both worlds 😆
Oh that's fine then, it's only the worst of both worlds 😆
I don't know, I thought it would have to pull things in at at least the chunk level to read them, so it seems kind of alright? (at least it doesn't pull 2GB into memory - I reckon that's a win! :p)
Oh you're right I misinterpreted this - I'm too ready to hear "poor chunking means the while file is loaded".
the netCDF file isn't realising everything, it's just pulling the entire chunk into memory
Oh that's fine then, it's only the worst of both worlds 😆
I don't know, I thought it would have to pull things in at at least the chunk level to read them, so it seems kind of alright? (at least it doesn't pull 2GB into memory - I reckon that's a win! :p)
Well yes, of course it must load "some few" chunks at some time.
The chunksize default is now configurable via Dask, where it defaults to (I believe) about 100Mb :
code here gets the target size from dask
default:
>>> import dask.config
>>> import dask.array
>>> dask.config.get("array.chunk-size")
'128MiB'
>>>
In practice, I've seen that you can expect about 3* this.
But I suspect it may not be deterministic how soon it releases memory afterward : it may be ~like the Python garbage collector.
In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.
If this issue is still important to you, then please comment on this issue and the stale label will be removed.
Otherwise this issue will be automatically closed in 28 days time.
In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.
If this issue is still important to you, then please comment on this issue and the stale label will be removed.
Otherwise this issue will be automatically closed in 28 days time.