iris icon indicating copy to clipboard operation
iris copied to clipboard

Which statistics in `iris.analysis` are lazy?

Open stefsmeets opened this issue 3 years ago • 4 comments

📰 Custom Issue

Hi everyone, we are currently working on a feature to make our multimodel calculations lazy in ESMValTool by depending on iris.analysis to perform the calculations (https://github.com/ESMValGroup/ESMValCore/pull/968). The documentation states that MEAN, STD_DEV and VARIANCE already have lazy implementations via dask.

Looking through the code, I have noticed that iris.analysis.MIN and iris.analysis.MAX also have lazy functions associated with them, but these are not mentioned in the documentation as being lazy. I'm wondering if I'm missing something or if this information is not yet available in the documentation.

We would be very interested to also make some of the other statistics lazy on our side, i.e. MEDIAN and PERCENTILE, which also have implementations available via dask.

stefsmeets avatar Feb 26 '21 15:02 stefsmeets

Hi @stefsmeets, thanks for this. I agree we ought to have something in the docstrings that tells us which aggregators are lazy. We have an open issue about doing this for functions generally (#3292), but nobody has got to it yet.

For percentiles, I have an open PR at #3901. I’d welcome any feedback on that as I’m pretty new to dask.

rcomer avatar Feb 27 '21 09:02 rcomer

It looks like dask.array.median is using numpy.median under the hood, so doesn't respect masks:

import numpy.ma as ma
import dask.array as da

arr = ma.array(range(4), mask=[0,0,0,1])
print(ma.median(arr))

larr = da.from_array(arr)
print(da.median(larr, axis=0).compute())

Output:

1.0
1.5

So I think we would need something extra to make lazy median consistent with our existing median aggregator.

rcomer avatar Feb 27 '21 13:02 rcomer

Hi @rcomer , I just noticed that a nanmedian function exists in dask. Would this be a way to make the median operation lazy in iris?

stefsmeets avatar Mar 11 '21 15:03 stefsmeets

Hi @stefsmeets, yes it looks like that should work in principle. Something like

import numpy as np
import numpy.ma as ma
import dask.array as da

def lazy_median(array, axis):
    array = array.astype(np.float_)
    nan_array = da.ma.filled(array, np.nan)
    median = da.nanmedian(nan_array, axis)
    return da.ma.fix_invalid(median)

arr = ma.array(range(4), mask=[0,0,0,1])
print(ma.median(arr))

larr = da.from_array(arr)
print(lazy_median(larr, axis=0).compute())

Output:

1.0
1.0

Though I am very much not an expert. Maybe @pp-mo has thoughts on this.

rcomer avatar Mar 11 '21 17:03 rcomer

Related: #3292

trexfeathers avatar Oct 17 '22 10:10 trexfeathers

@fnattino: This issue could be interesting for you

bouweandela avatar Nov 10 '22 08:11 bouweandela

#5066 updated all the aggregator docstrings to indicate which are lazy. PERCENTILE has been lazy since v3.3. Is there still appetite to work on MEDIAN? Obviously you can just use the 50th percentile but perhaps there is an advantage to the separate median function.

rcomer avatar Nov 23 '22 08:11 rcomer

@SciTools/peloton let's wait till 2023-01 if anyone really wants MEDIAN to be lazy. IF not, add a doc note to say "it's not lazy, but percentile is"

pp-mo avatar Nov 23 '22 10:11 pp-mo