Row-wise weighted mean gives incorrect results
Version of Awkward Array
2.6.9
Description and code to reproduce
I am doing a row-wise weighted mean in an awkward array and getting wrong results.
Here's an MRE with outputs in the comments:
import awkward as ak
data = ak.Array(
[
[1, 2, 3],
[4, 5],
]
)
weight = ak.Array(
[
[1, 1, 2],
[1, 10],
]
)
# manual row-by-row - expected results
print(ak.mean(data[0], weight=weight[0])) # -> 2.25
print(ak.mean(data[1], weight=weight[1])) # -> 4.909090909090909
# manual vectorized - expected results
weights_norm = weight / ak.sum(weight, axis=1)
print(ak.sum(weights_norm * data, axis=1)) # -> [2.25, 4.91]
# the most natural call I expected to work - incorrect result in the 2nd row
print(ak.mean(data, weight=weight, axis=1)) # -> [2.25, 13.5]
Also, it might be interesting that on a large dataset the manual vectorized operation (normalize weights by their row-wise sum, multiply data by them and sum row-wise) is much faster compared with ak.mean(data, weight, axis=1). For my dataset the latter is ~10 sec and the former is <0.1 sec.
Thanks for submitting the issue! It's definitely a bug. I can reproduce it.
I assume the expected behavior is as from Numpy's average. Here is a Numpy reproducer using numpy.average:
np_data = np.array([[1, 2, 3], [4, 5, 6]])
np_weight = np.array([[1, 1, 2], [1, 10, 0]])
np.average(np_data, axis=1, weights=np_weight) # array([2.25 , 4.90909091])
The difference between ak.mean and np.average define weights differently:
Numpy weights have to be the same shape:
weights : array_like, optional
An array of weights associated with the values in `a`. Each value in
`a` contributes to the average according to its associated weight.
The array of weights must be the same shape as `a` if no axis is
specified, otherwise the weights must have dimensions and shape
consistent with `a` along the specified axis.
in Awkward:
weight – Data that can be broadcasted to x to give each value a weight.
Weighting values equally is the same as no weights; weighting some values
higher increases the significance of those values. Weights can be zero or negative.
Both functions agree on Awkward definition:
Passing all arguments to the reducers, the mean is calculated as
ak.sum(x*weight) / ak.sum(weight)
Hey, thanks for the confirmation!
I tinkered with the implementation a bit and it seems like the issue is around keepdims argument in this call. If I change it to True, my example works fine (other uses might break though!).
I could try to make a complete PR if you (@ianna) or other maintainers greenlight this as a first-time issue.
Thanks, @nj-vs-vh ! Indeed, if keepdims is set to True the result looks correct:
>>> ak.mean(arr, weight=weight, axis=1, keepdims=True)
<Array [[2.25], [4.91]] type='2 * 1 * float64'>
If keepdims set to False the result is as follows:
>>> ak.mean(arr, weight=weight, axis=1, keepdims=False)
<Array [2.25, 13.5] type='2 * float64'>
while, I think, what is expected is:
>>> ak.mean(arr, weight=weight, axis=1, keepdims=True)
<Array [[2.25], [4.91]] type='2 * 1 * float64'>
>>> ak.flatten(ak.mean(arr, weight=weight, axis=1, keepdims=True))
<Array [2.25, 4.91] type='2 * float64'>
The ak.mean function without weight produces correct result for either of keepdims (False is a default):
>>> ak.mean(arr, axis=1, keepdims=True)
<Array [[2], [4.5]] type='2 * 1 * float64'>
>>> ak.mean(arr, axis=1)
<Array [2, 4.5] type='2 * float64'>
>>> ak.mean(arr, axis=1, keepdims=False)
<Array [2, 4.5] type='2 * float64'>
Thanks, @nj-vs-vh ! Indeed, if
keepdimsis set toTruethe result looks correct:>>> ak.mean(arr, weight=weight, axis=1, keepdims=True) <Array [[2.25], [4.91]] type='2 * 1 * float64'>If
keepdimsset toFalsethe result is as follows:>>> ak.mean(arr, weight=weight, axis=1, keepdims=False) <Array [2.25, 13.5] type='2 * float64'>while, I think, what is expected is:
>>> ak.mean(arr, weight=weight, axis=1, keepdims=True) <Array [[2.25], [4.91]] type='2 * 1 * float64'> >>> ak.flatten(ak.mean(arr, weight=weight, axis=1, keepdims=True)) <Array [2.25, 4.91] type='2 * float64'>The
ak.meanfunction withoutweightproduces correct result for either ofkeepdims(Falseis a default):>>> ak.mean(arr, axis=1, keepdims=True) <Array [[2], [4.5]] type='2 * 1 * float64'> >>> ak.mean(arr, axis=1) <Array [2, 4.5] type='2 * float64'> >>> ak.mean(arr, axis=1, keepdims=False) <Array [2, 4.5] type='2 * float64'>
It might be sufficient to set keepdims=True here: https://github.com/scikit-hep/awkward/blob/main/src/awkward/operations/ak_mean.py#L228?
One has to be careful here with broadcasting that happens inside of the ak.mean implementation. The same goes for ak.corr and ak.covar that probably need to be updated as well.
My understanding is: the correct implementation policy would be keepdims=True (in order to have correct broadcasting) and then adjust the dimension for the output array (see: https://github.com/scikit-hep/awkward/blob/main/src/awkward/operations/ak_mean.py#L256-L264) - this is something that ak.corr and ak.covar should do at some point aswell.
Thanks, @nj-vs-vh ! Indeed, if
keepdimsis set toTruethe result looks correct:>>> ak.mean(arr, weight=weight, axis=1, keepdims=True) <Array [[2.25], [4.91]] type='2 * 1 * float64'>If
keepdimsset toFalsethe result is as follows:>>> ak.mean(arr, weight=weight, axis=1, keepdims=False) <Array [2.25, 13.5] type='2 * float64'>while, I think, what is expected is:
>>> ak.mean(arr, weight=weight, axis=1, keepdims=True) <Array [[2.25], [4.91]] type='2 * 1 * float64'> >>> ak.flatten(ak.mean(arr, weight=weight, axis=1, keepdims=True)) <Array [2.25, 4.91] type='2 * float64'>The
ak.meanfunction withoutweightproduces correct result for either ofkeepdims(Falseis a default):>>> ak.mean(arr, axis=1, keepdims=True) <Array [[2], [4.5]] type='2 * 1 * float64'> >>> ak.mean(arr, axis=1) <Array [2, 4.5] type='2 * float64'> >>> ak.mean(arr, axis=1, keepdims=False) <Array [2, 4.5] type='2 * float64'>It might be sufficient to set
keepdims=Truehere: https://github.com/scikit-hep/awkward/blob/main/src/awkward/operations/ak_mean.py#L228?One has to be careful here with broadcasting that happens inside of the
ak.meanimplementation. The same goes forak.corrandak.covarthat probably need to be updated as well.My understanding is: the correct implementation policy would be
keepdims=True(in order to have correct broadcasting) and then adjust the dimension for the output array (see: https://github.com/scikit-hep/awkward/blob/main/src/awkward/operations/ak_mean.py#L256-L264) - this is something thatak.corrandak.covarshould do at some point aswell.
what about mask_identity?