StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

Weighted Arrays?

Open ParadaCarleton opened this issue 3 years ago • 2 comments

I've been considering this for a while. Would it make sense to define a new struct, a weighted_array, which contains both an array and a set of weights? The primary advantages are as follows:

  1. The weighted array can be stored contiguously in memory as an array of (element, weight) tuples. Weights and array elements are almost always accessed together, so this allows for faster access.
  2. Allows weighted_arrays to be passed as a single argument in place of an array.
  3. The user can conveniently manipulate weights together with observations. For example, dropping missing values would also automatically drop the weights associated with them. (The old interface can also be kept.)

ParadaCarleton avatar Mar 23 '22 15:03 ParadaCarleton

There's been some discussion about something similar at https://github.com/JuliaLang/julia/pull/33310#issuecomment-546978274 (and following comments) and https://github.com/JuliaLang/Statistics.jl/issues/88. It could be an interesting alternative to passing weights as a separate argument. But I find the syntax a bit weird with functions that take several arguments, like cor(weighted(w, x), weighted(w, y)) or (more compact but weirder) cor(weighted(w, x, y)) -- and ideally we want to have a consistent syntax for single- and multiple-argument functions. Of course we could support two different syntaxes, but for now I'd rather focus on getting a single syntax work correctly in all cases (notably skipping missing values).

Regarding performance, it would probably not be faster:

  • It's not clear that it would be faster to store (element, weight) tuples. When processing arrays in loops, AFAIK it's easier to get the compiler to use SIMD instructions when working on two separate arrays. And if you combine e.g. an Int8 value with a Float64 weight, you have to add some padding in the array to ensure all elements are aligned:
julia> Base.summarysize(fill((Int8(1), 1.0), 10_000))
160040

julia> Base.summarysize(fill(Int8(1), 10_000)) + Base.summarysize(fill(1.0, 10_000))
90080
  • Even if it was faster, having weighted_array make a copy of the values and weights to allocate a vector of (element, weight) tuples would be prohibitively slow if you need to compute weighted stats on different variables. That said, we could implement such a wrapper which would be a view of the inputs (like AbstractWeights currently).

nalimilan avatar Mar 26 '22 13:03 nalimilan