iris icon indicating copy to clipboard operation
iris copied to clipboard

custom statistic to return a tuple rather than a scalar

Open berndbecker opened this issue 5 years ago • 13 comments

✨ Feature Request

Make custom statistic return a tuple rather than a scalar. MISSION, store a vector of threshold exceedances of increasing duration at each gridpoint in liew for the time domain. (much shorter)

In the example

https://scitools-iris.readthedocs.io/en/stable/generated/gallery/general/plot_custom_aggregation.html a single number is returned at each gridpoint. I am after functionality that returns more than one value for each grid point.

Motivation

Not sure if this is an issue, but I have colleagues who calculated threschold exceedance durations at great pains. Feedback on my request from an AVD surgery was also pointing to hightened frustration as to how complicated "this" is. With this I mean doing something on a time series, stored at each grid point (3-D cube) and retaining a set of numbers rather than collapsing the time dimension to just one (max, min, mean) number.

I'm always frustrated when something is almost doable but does not quite work and you have to go all the way back and do it with a sledge hammer.

Additional context

Click to expand this section... I need a push to understand custom statistics better.

In the attached example ( run with module load scitools/experimental-current, python /net/home/h02/frtm/prog/wcssp/wcssp5/scripts/ts_exceedance.py)

I am compiling a threshold exceedance duration or survival function For rainfall time series. Asking how many rainy periods were longer than 1, 2, ....5. and so on days. This works for a demonstrator on a single time series.

Next I would like to run the same custom statistic at each grid point as in https://scitools.org.uk/iris/docs/latest/examples/General/custom_aggregation.html#general-custom-aggregation

But I struggle to understand the shape of data being passed to aggregator, what should axis be?
And I have no idea how to store the survivers vector over the time series dimension.

But I am convinced it is not really that difficult.

Add additional verbose information in a collapsible section.

See here for further details.

berndbecker avatar Oct 07 '20 14:10 berndbecker

Possibly related: #3810 #3331

rcomer avatar Oct 07 '20 14:10 rcomer

So, if I've understood, you start with a cube that is (time, latitude, longitude), and you want to end up with a cube that is (durations, latitude, longitude), having done your calculation over time at each grid point. The problem is that the standard iris Aggregator class is designed to reduce the dimensionality down to just (latitude, longitude) when used with collapsed.

We do have the PercentileAggregator class, which has the capacity to add a "percent" dimension if you want to calculate more than one percentile. So we know that it is possible to add dimensions. That class is hard-coded to calculate percentiles though so, if you wanted to make use of it to calculate some other dimension-adding statistic, I think you'd need to subclass it. It also isn't even listed in the docs.

So possibly what we need here is to generalise PercentileAggregator into a class that could create dimension-adding aggregators based on user-defined functions.

rcomer avatar Oct 08 '20 17:10 rcomer

Having said that, this particular statistic presumably needs information from the time coordinate. I think all the existing aggregation calculations only use the cube data. 🤔

rcomer avatar Oct 08 '20 18:10 rcomer

The threshold exceedance duration may live without information from the time coordinate for the time being. The PercentileAggregator would deliver on what I expected for starters to be an easy operation. For a generalization later, more complex combination of meta data is a possibility but that can wait.

berndbecker avatar Oct 09 '20 09:10 berndbecker

Perhaps it is easier if the shape of the tuple to be returned is set at the beginning. I.e it could be the list of linear regression coefficients, or the first 4 moments of normal distribution or the list of percentiles as in the Percentil Aggregator or a list of durations in time units.

berndbecker avatar Oct 21 '20 10:10 berndbecker

@rcomer Fancy taking this on?

bjlittle avatar Oct 28 '20 10:10 bjlittle

Hey @bjlittle, sorry I think I'd struggle to justify time on this one. My PRs generally fall into two categories:

  • it directly affects my (or someone in my group's) work
  • it's small enough to do "in the margins", so don't need to justify the time

While this one doesn't look huge, it looks like more that a 5 min job.

rcomer avatar Oct 28 '20 11:10 rcomer

So possibly what we need here is to generalise PercentileAggregator into a class that could create dimension-adding aggregators based on user-defined functions.

While digging to find something else, I noticed that PercentileAggregator was in fact originally written as AdditiveAggregator but was changed "after review discussion" as part of #1569. So there were reasons to make it specific, but I can't see from that PR what the reasons were.

Here be dragons.

rcomer avatar Aug 23 '21 12:08 rcomer

Note that #3901 also makes changes to the percentile aggregator, so it may be better to wait until that is resolved before starting work on this. Otherwise we could create some nasty code conflicts.

rcomer avatar Sep 22 '21 10:09 rcomer

Hi @berndbecker, sorry for the delay on this - it's both difficult and slightly niche! Is it still something you'd be interested in seeing in Iris?

If you think others would also be interested, we encourage you and them to try out the new voting feature.

trexfeathers avatar Apr 06 '22 09:04 trexfeathers

Hi Martin,

Nice to hear from you! This feature request fits with others working on threshold exceedance, percentiles, etc. So much functionality is nearly there so it could be very rewarding, with some effort , to Make this happen.

Albeit, for now, I am working on clustering on single point time series. Dismantling a cube to a single time series, running the clustering and reassembling a cube From the single point results is painful and fraught with error. Having the facility described in the #3904 would come in handy here as well.

People are shouting out for something similar here as well: https://web.yammer.com/main/threads/eyJfdHlwZSI6IlRocmVhZCIsImlkIjoiMTYyODUwMzA0OTkwNDEyOCJ9?search=aggregator&groupScope=eyJfdHlwZSI6Ikdyb3VwIiwiaWQiOiIxMDU5MjUyMCJ9

All the best, Bernd.

From: Martin Yeo @.> Sent: 06 April 2022 10:42 To: SciTools/iris @.> Cc: Becker, Bernd @.>; Mention @.> Subject: Re: [SciTools/iris] custom statistic to return a tuple rather than a scalar (#3904)

This email was received from an external source. Always check sender details, links & attachments.

Hi @berndbeckerhttps://github.com/berndbecker, sorry for the delay on this - it's both difficult and slightly niche! Is it still something you'd be interested in seeing in Iris?

— Reply to this email directly, view it on GitHubhttps://github.com/SciTools/iris/issues/3904#issuecomment-1090067949, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQIRTJB2QEMAK2A6YFAXTGTVDVL45ANCNFSM4SHPX5FA. You are receiving this because you were mentioned.Message ID: @.***>

berndbecker avatar Apr 06 '22 10:04 berndbecker

@wjbenfold has #4676 to implement an aggregator for number of days of data matching certain criteria (e.g. above a threshold), which I think addresses that Yammer thread. However, it would only handle a single threshold value at a time I think.

rcomer avatar Apr 06 '22 12:04 rcomer

I'm currently intending that it can handle being between two thresholds (or any other criterion you can write as a lambda) but only one condition at a time, yes

wjbenfold avatar Apr 06 '22 15:04 wjbenfold

I just changed the title to something a bit more general. I actually think there are two different possibilities here for extending the capabilities :

  • firstly, a calculation statistic that returns multiple statistical components
    • in these cases, the cube method (collapse/aggregrated_by/rolling_winfow) would naturally return multiple cubes instead of one
    • a classic example would be a linear regression operator, which computes "slope" + "intercept" values together
  • secondly, a statistical operation repeated over multiple thresholds, categories, etc
    • in these cases, the result would have an extra dimension -- e.g. threshold, category, histogram-bin
    • as an example, we already have the PERCENTILE operator.
      But we don't have an easy way of creating a custom statistic of this sort.
    • a relevant example that came up lately : calculating frequency of occurrence (over a time period) from category values (over time + locations)

From an efficiency point of view, it is always possible to make multiple statistical cubes, and use the CubeList.realise_data method to efficiently calculate multiple statistics over the same data. Also, the 'extra dimension' cases can be constructed with by creating multiple statistical result cubes; adding a defining scalar coord; and merging into one. But obviously, from a simplicity + convenience PoV this can be improved !!

pp-mo avatar Nov 16 '22 12:11 pp-mo