StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

Array Weights

Open ParadaCarleton opened this issue 4 years ago • 5 comments

Currently, weights can only be vectors, but there's no reason to rule out array-variate weights. I came across this problem while trying to generalize MCMCChains.jl to allow for weighted samples, but having weighted matrices isn't that uncommon when working with panel data, where the probability of follow-up can differ over time. What would we need to get weighted matrices working?

ParadaCarleton avatar Dec 10 '21 20:12 ParadaCarleton

How would the stats functions typically work for weight matrices? Would it be comparable to broadcasting the weighted stats function across both weights and data (either colwise or rowwise)?

rofinn avatar Dec 22 '21 19:12 rofinn

How would the stats functions typically work for weight matrices? Would it be comparable to broadcasting the weighted stats function across both weights and data (either colwise or rowwise)?

Yes; the mean across, say, the first row would just be the same as taking the weighted mean of that row, using the weights in the first row of the weight matrix.

ParadaCarleton avatar Dec 23 '21 00:12 ParadaCarleton

Could you show a concrete example where you would use weight matrices?

In general I think it could make sense to allow any weight arrays, but that would require carefully checking that all functions taking AbstractWeights are correctly written to handle them. Notably, they should check that data and weights have identical shapes to avoid incorrect or surprising behaviors (e.g. with data=[1, 2, 3, 4] and weights=[1 2; 3 4], which should probably not be allowed).

Also note that currently weight vectors are automatically broadcasted when computing e.g. the mean over dimensions:

julia> mean(rand(3, 4), weights(rand(4)), dims=2)
3×1 Matrix{Float64}:
 0.6384569311824672
 0.3961733566525447
 0.6022251466078608

If we allow weight matrices, we will have to make sure they don't interfere with this behavior.

nalimilan avatar Jan 13 '22 14:01 nalimilan

Could you show a concrete example where you would use weight matrices?

The current use case is MCMC samples where I have a set of weights that can vary across multiple dimensions (namely I want weight[2,4] to represent the weight of the second sample of the 4th chain in a collection).

A more general use case is any study design where you have unequal probabilities of sampling the same person across time for followup -- for instance, we might oversample participants of one group by a factor of 2 at the start, but follow up with them increasingly often as time goes on (if we expect this group to have higher dropout rates, requiring us to sample them more frequently to get enough data).

ParadaCarleton avatar Jan 13 '22 17:01 ParadaCarleton

My 2 cents: One fairly common case where this would be useful is for spatial data, e.g., variables of a climate model.

For example, if I want to take the zonal mean of some variable (e.g., surface temperature, a 2D array), weighted by the area of each grid cell (thus with weights as a 2D array), I'd like to do

julia> mean(temperature, weights(area), dims=1)

but this currently throws with temperature = rand(2,2) and area = rand(2,2):

ERROR: DimensionMismatch: Inconsistent array dimension.
Stacktrace:
 [1] #wsum!#11
   @ ~/.julia/packages/StatsBase/XgjIN/src/weights.jl:589 [inlined]
 [2] #sum!#12
   @ ~/.julia/packages/StatsBase/XgjIN/src/weights.jl:615 [inlined]
 [3] sum!
   @ ~/.julia/packages/StatsBase/XgjIN/src/weights.jl:615 [inlined]
 [4] _mean!(R::Matrix{Float64}, A::Matrix{Float64}, w::Weights{Float64, Float64, Vector{Float64}}, dims::Int64)
   @ StatsBase ~/.julia/packages/StatsBase/XgjIN/src/weights.jl:644
 [5] _mean
   @ ~/.julia/packages/StatsBase/XgjIN/src/weights.jl:669 [inlined]
 [6] #mean#15
   @ ~/.julia/packages/StatsBase/XgjIN/src/weights.jl:665 [inlined]
 [7] kwcall(::NamedTuple{(:dims,), Tuple{Int64}}, ::typeof(mean), A::Matrix{Float64}, w::Weights{Float64, Float64, Vector{Float64}})
   @ StatsBase ~/.julia/packages/StatsBase/XgjIN/src/weights.jl:665
 [8] top-level scope
   @ REPL[400]:1 

briochemc avatar Jan 30 '23 01:01 briochemc