iris
iris copied to clipboard
Which statistics in `iris.analysis` are lazy?
📰 Custom Issue
Hi everyone, we are currently working on a feature to make our multimodel calculations lazy in ESMValTool by depending on iris.analysis
to perform the calculations (https://github.com/ESMValGroup/ESMValCore/pull/968). The documentation states that MEAN
, STD_DEV
and VARIANCE
already have lazy implementations via dask
.
Looking through the code, I have noticed that iris.analysis.MIN
and iris.analysis.MAX
also have lazy functions associated with them, but these are not mentioned in the documentation as being lazy. I'm wondering if I'm missing something or if this information is not yet available in the documentation.
We would be very interested to also make some of the other statistics lazy on our side, i.e. MEDIAN
and PERCENTILE
, which also have implementations available via dask.
Hi @stefsmeets, thanks for this. I agree we ought to have something in the docstrings that tells us which aggregators are lazy. We have an open issue about doing this for functions generally (#3292), but nobody has got to it yet.
For percentiles, I have an open PR at #3901. I’d welcome any feedback on that as I’m pretty new to dask.
It looks like dask.array.median
is using numpy.median
under the hood, so doesn't respect masks:
import numpy.ma as ma
import dask.array as da
arr = ma.array(range(4), mask=[0,0,0,1])
print(ma.median(arr))
larr = da.from_array(arr)
print(da.median(larr, axis=0).compute())
Output:
1.0
1.5
So I think we would need something extra to make lazy median consistent with our existing median aggregator.
Hi @rcomer , I just noticed that a nanmedian
function exists in dask. Would this be a way to make the median operation lazy in iris?
Hi @stefsmeets, yes it looks like that should work in principle. Something like
import numpy as np
import numpy.ma as ma
import dask.array as da
def lazy_median(array, axis):
array = array.astype(np.float_)
nan_array = da.ma.filled(array, np.nan)
median = da.nanmedian(nan_array, axis)
return da.ma.fix_invalid(median)
arr = ma.array(range(4), mask=[0,0,0,1])
print(ma.median(arr))
larr = da.from_array(arr)
print(lazy_median(larr, axis=0).compute())
Output:
1.0
1.0
Though I am very much not an expert. Maybe @pp-mo has thoughts on this.
Related: #3292
@fnattino: This issue could be interesting for you
#5066 updated all the aggregator docstrings to indicate which are lazy. PERCENTILE has been lazy since v3.3. Is there still appetite to work on MEDIAN? Obviously you can just use the 50th percentile but perhaps there is an advantage to the separate median function.
@SciTools/peloton let's wait till 2023-01 if anyone really wants MEDIAN to be lazy. IF not, add a doc note to say "it's not lazy, but percentile is"