StatsBase.jl Fused computation of statistics

I have been considering a uniform interface for computing multiple statistics all at once, while allowing they share part of the computation.

Consider the following example. We want to compute sum, mean, var, and std from x:

s, m, v, sd = sum(x), mean(x), var(x), std(x)

This clearly would waste a lot of computation (e.g it actually computes sum four times, mean three times, and variance twice).

A more efficient way would be

s = sum(x)
m = s / length(x)
v = varm(x, m)
sd = sqrt(v)

This is more efficient, but not as concise and convenient.

I am considering the following way:

s, m, v, sd = stats(x, (sum_, mean_, var_, std_))

Internally, it should find an efficient routine that computes them altogether. Here, sum_ and mean_ are typed indicators defined as

type Sum_ end
type Mean_ end
type Var_ end
type Std_ end

const sum_ = Sum_()
const mean_ = Mean_()
const var_ = Var_()
const std_ = Std_()

Different combinations of statistics are different tuple types, and therefore we can leverage Julia's multiple dispatch mechanism to choose the optimal computation paths.

This is not urgent, but would be really nice to have. I am not going to implement this in near future. Just open this thread to collect ideas, suggestions, and opinions.

Jun 23 '14 15:06 lindahua

Another way is to use macros:

s, m, v, sd = @stats(x, (sum, mean, var, std))

Instead of using multiple dispatch (which would grow exponentially as the tuple size increase), the macro can generate the optimal codes at compile-time.

And we don't have to define a bunch of things like sum_, mean_, etc.

Jun 23 '14 15:06 lindahua

I agree we need something like this, it would be even better if we could tie it into the sufficient statistics stuff in Distributions.

Jun 23 '14 18:06 simonbyrne

Good idea (better than mean_and_std :-). The macro solution looks the best to me.

Jun 23 '14 19:06 nalimilan

This would be great to have. Similar issues come in HypothesisTests.jl, where many statistics require the calculation of other simpler statistics.

Jun 24 '14 15:06 johnmyleswhite

How about this idea: http://en.wikipedia.org/wiki/Incremental_computing

So we could have an increment computing model that is initialized on the data set. Statistics on this model would then cache any reusable intermediate values, so that subsequent requests on those values or other statistics that depend on them just get the cached values from the model, speeding up computations. This way we are not limited just on the few standard functions.

The code may look something like (very hand-wavy)

model = IC_single_factor_model()
setdata!( model, x )
mean = get( model, :mean ) # this depends on sum, so
sum = get( model, :sum)  # this becomes constant time
std = get( model, :std ) # this requires another O(n) sum of squares, but O(1) cached lookup of sum.
var = get( model, :var ) # but at this point var becomes constant time

With all these flexibilities, these are the prices we'd pay (cons):

the caching mechanism presumably forces us to store intermediate values in a dictionary at runtime, not variables that may be optimized into registers at compile time, when the range of stats needed are known to be small.
the incremental computing model needs to parse what we want done with statistics with those data, in order to properly generate the dependency graph. This feels like duplicated coding efforts for the sake of a constant factor improvement (although this constant factor can be significant, as data size grows).
aesthetically, the lookup may seem a bit unnatural, but it can be hidden via macros.

Dec 20 '14 10:12 tonyhffong

Status of this?

Jun 19 '19 22:06 Nosferican

Note that we can now use s, m, v, sd = stats(x, (sum, mean, var, std)) since functions have their own types.

Jun 20 '19 07:06 nalimilan

This isn't implemented, my last comment just meant that we can now use (sum, mean) instead of (:sum, :mean).

Oct 28 '23 17:10 nalimilan

Thanks, sorry about that.

Oct 28 '23 18:10 ParadaCarleton

I'll note that this can be done automatically using the lazy computation provided by something like Transducers.jl, or graph reduction algorithms in general.

Oct 29 '23 16:10 ParadaCarleton