awkward Row-wise weighted mean gives incorrect results

Version of Awkward Array

2.6.9

Description and code to reproduce

I am doing a row-wise weighted mean in an awkward array and getting wrong results.

Here's an MRE with outputs in the comments:

import awkward as ak

data = ak.Array(
    [
        [1, 2, 3],
        [4, 5],
    ]
)

weight = ak.Array(
    [
        [1, 1, 2],
        [1, 10],
    ]
)

# manual row-by-row - expected results
print(ak.mean(data[0], weight=weight[0]))  # -> 2.25
print(ak.mean(data[1], weight=weight[1]))  # -> 4.909090909090909

# manual vectorized - expected results
weights_norm = weight / ak.sum(weight, axis=1)
print(ak.sum(weights_norm * data, axis=1))  # -> [2.25, 4.91]

# the most natural call I expected to work - incorrect result in the 2nd row
print(ak.mean(data, weight=weight, axis=1))  # -> [2.25, 13.5]

Oct 17 '24 17:10 nj-vs-vh

Also, it might be interesting that on a large dataset the manual vectorized operation (normalize weights by their row-wise sum, multiply data by them and sum row-wise) is much faster compared with ak.mean(data, weight, axis=1). For my dataset the latter is ~10 sec and the former is <0.1 sec.

Oct 17 '24 17:10 nj-vs-vh

Thanks for submitting the issue! It's definitely a bug. I can reproduce it.

I assume the expected behavior is as from Numpy's average. Here is a Numpy reproducer using numpy.average:

np_data = np.array([[1, 2, 3], [4, 5, 6]])
np_weight = np.array([[1, 1, 2], [1, 10, 0]])
np.average(np_data, axis=1, weights=np_weight) # array([2.25      , 4.90909091])

The difference between ak.mean and np.average define weights differently:

Numpy weights have to be the same shape:

weights : array_like, optional
        An array of weights associated with the values in `a`. Each value in
        `a` contributes to the average according to its associated weight.
        The array of weights must be the same shape as `a` if no axis is
        specified, otherwise the weights must have dimensions and shape
        consistent with `a` along the specified axis.

in Awkward:

weight – Data that can be broadcasted to x to give each value a weight. 
        Weighting values equally is the same as no weights; weighting some values 
        higher increases the significance of those values. Weights can be zero or negative.

Both functions agree on Awkward definition:

Passing all arguments to the reducers, the mean is calculated as

        ak.sum(x*weight) / ak.sum(weight)

Nov 01 '24 15:11 ianna

Hey, thanks for the confirmation!

I tinkered with the implementation a bit and it seems like the issue is around keepdims argument in this call. If I change it to True, my example works fine (other uses might break though!).

I could try to make a complete PR if you (@ianna) or other maintainers greenlight this as a first-time issue.

Nov 04 '24 10:11 nj-vs-vh

Thanks, @nj-vs-vh ! Indeed, if keepdims is set to True the result looks correct:

>>> ak.mean(arr, weight=weight, axis=1, keepdims=True)
<Array [[2.25], [4.91]] type='2 * 1 * float64'>

If keepdims set to False the result is as follows:

>>> ak.mean(arr, weight=weight, axis=1, keepdims=False)
<Array [2.25, 13.5] type='2 * float64'>

while, I think, what is expected is:

>>> ak.mean(arr, weight=weight, axis=1, keepdims=True)
<Array [[2.25], [4.91]] type='2 * 1 * float64'>
>>> ak.flatten(ak.mean(arr, weight=weight, axis=1, keepdims=True))
<Array [2.25, 4.91] type='2 * float64'>

The ak.mean function without weight produces correct result for either of keepdims (False is a default):

>>> ak.mean(arr, axis=1, keepdims=True)
<Array [[2], [4.5]] type='2 * 1 * float64'>
>>> ak.mean(arr, axis=1)
<Array [2, 4.5] type='2 * float64'>
>>> ak.mean(arr, axis=1, keepdims=False)
<Array [2, 4.5] type='2 * float64'>

Nov 04 '24 13:11 ianna

Thanks, @nj-vs-vh ! Indeed, if keepdims is set to True the result looks correct:

>>> ak.mean(arr, weight=weight, axis=1, keepdims=True)
<Array [[2.25], [4.91]] type='2 * 1 * float64'>

If keepdims set to False the result is as follows:

>>> ak.mean(arr, weight=weight, axis=1, keepdims=False)
<Array [2.25, 13.5] type='2 * float64'>

while, I think, what is expected is:

>>> ak.mean(arr, weight=weight, axis=1, keepdims=True)
<Array [[2.25], [4.91]] type='2 * 1 * float64'>
>>> ak.flatten(ak.mean(arr, weight=weight, axis=1, keepdims=True))
<Array [2.25, 4.91] type='2 * float64'>

The ak.mean function without weight produces correct result for either of keepdims (False is a default):

>>> ak.mean(arr, axis=1, keepdims=True)
<Array [[2], [4.5]] type='2 * 1 * float64'>
>>> ak.mean(arr, axis=1)
<Array [2, 4.5] type='2 * float64'>
>>> ak.mean(arr, axis=1, keepdims=False)
<Array [2, 4.5] type='2 * float64'>

It might be sufficient to set keepdims=True here: https://github.com/scikit-hep/awkward/blob/main/src/awkward/operations/ak_mean.py#L228?

One has to be careful here with broadcasting that happens inside of the ak.mean implementation. The same goes for ak.corr and ak.covar that probably need to be updated as well.

My understanding is: the correct implementation policy would be keepdims=True (in order to have correct broadcasting) and then adjust the dimension for the output array (see: https://github.com/scikit-hep/awkward/blob/main/src/awkward/operations/ak_mean.py#L256-L264) - this is something that ak.corr and ak.covar should do at some point aswell.

Nov 04 '24 14:11 pfackeldey

Thanks, @nj-vs-vh ! Indeed, if keepdims is set to True the result looks correct:
>>> ak.mean(arr, weight=weight, axis=1, keepdims=True)
<Array [[2.25], [4.91]] type='2 * 1 * float64'>
If keepdims set to False the result is as follows:
>>> ak.mean(arr, weight=weight, axis=1, keepdims=False)
<Array [2.25, 13.5] type='2 * float64'>
while, I think, what is expected is:
>>> ak.mean(arr, weight=weight, axis=1, keepdims=True)
<Array [[2.25], [4.91]] type='2 * 1 * float64'>
>>> ak.flatten(ak.mean(arr, weight=weight, axis=1, keepdims=True))
<Array [2.25, 4.91] type='2 * float64'>
The ak.mean function without weight produces correct result for either of keepdims (False is a default):
>>> ak.mean(arr, axis=1, keepdims=True)
<Array [[2], [4.5]] type='2 * 1 * float64'>
>>> ak.mean(arr, axis=1)
<Array [2, 4.5] type='2 * float64'>
>>> ak.mean(arr, axis=1, keepdims=False)
<Array [2, 4.5] type='2 * float64'>
It might be sufficient to set keepdims=True here: https://github.com/scikit-hep/awkward/blob/main/src/awkward/operations/ak_mean.py#L228?

One has to be careful here with broadcasting that happens inside of the ak.mean implementation. The same goes for ak.corr and ak.covar that probably need to be updated as well.

My understanding is: the correct implementation policy would be keepdims=True (in order to have correct broadcasting) and then adjust the dimension for the output array (see: https://github.com/scikit-hep/awkward/blob/main/src/awkward/operations/ak_mean.py#L256-L264) - this is something that ak.corr and ak.covar should do at some point aswell.

what about mask_identity?

Nov 04 '24 14:11 ianna