awkward icon indicating copy to clipboard operation
awkward copied to clipboard

Median

Open mariogeiger opened this issue 4 years ago • 2 comments

Hi :-)

Aim

I would like to compute the median

What I tried

I guess I should be able to do it using a combination of ak.sort and ak.num but I don't manage to do the indexing

mariogeiger avatar Feb 15 '21 13:02 mariogeiger

Median is different from mean, variance, covariances, and even linear fits in that it can't be computed from a small (non-growing) set of numbers. That's what everything in ak.operations.reducers.* have in common. However, this property is only really necessary for parallelizing a calculation, and that's not what you get when you call ak.mean, ak.var, etc., so maybe median does belong in that set.

If you're computing a median at axis=0, i.e. you have a one-dimensional array and you want to get a scalar result for the median, just pass it into the NumPy function, np.median, possibly wrapping the Awkward Array with np.asarray so that NumPy recognizes it as an array. (We're going to find a way to fix that: #630.)

For computing a median at axis=1, I've worked out how you can do it below. After going through it, I can totally sympathize that you had trouble with it! This case is tricky for a number of reasons, and so "median" should probably become a library function.

First, start with a small, synthetic example of what we want to solve. Here's an array with variable-length lists at axis=1, some of which are empty (so that the solution has to deal with that).

array = ak.Array([[2.2, 1.1, 0.0], [], [3.3, 4.4], [5.5], [8.8, 6.6, 7.7, 9.9]])

You're right that the first thing we want to do is sort within each list (i.e. in axis=1).

>>> sorted_array = ak.sort(array, axis=1)
>>> sorted_array.tolist()
[[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]]

We'll see in a moment that the empty list is going to be a problem. We'll want to find the two indexes that pick out the element or two elements that are closest to the center of each list, but the empty list has no indexes that can pick out an element. So we'll keep on hand a version of the sorted_array that doesn't have any empty lists, replacing them with None as a placeholder so that it doesn't further complicate the indexing (that is, every list stays where it is, unaffected by the removal of the empties). The ak.mask function does this, but it has a shorthand syntax through .mask[·].

>>> sorted_without_empties = sorted_array.mask[ak.num(sorted_array) != 0]
>>> sorted_without_empties.tolist()
[[0.0, 1.1, 2.2], None, [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]]

Now: on with the indexing. To compute a median, we want to average the element at floor((N - 1) / 2) and ceil((N - 1) / 2), where N is the length of each list. If N is even, these are the two elements that are equally close to the middle; if N is odd, these are both the same element, exactly in the middle. (Averaging a single number with itself returns itself, so we don't need to introduce an "if" statement, or ak.where to handle both even and odd.)

We can do that, starting with ak.num:

>>> ak.num(sorted_array, axis=1)
<Array [3, 0, 2, 1, 4] type='5 * int64'>
>>> np.floor((ak.num(sorted_array, axis=1) - 1) / 2)
<Array [1, -1, 0, 0, 1] type='5 * float64'>
>>> np.ceil((ak.num(sorted_array, axis=1) - 1) / 2)
<Array [1, -0, 1, 0, 2] type='5 * float64'>

(and you can already see how the indexes for the empty list, -1 and -0, are nonsensical). Although these values are, by construction, integers, they have floating point type and have to be converted to true integers. There's ak.values_astype for that.

>>> low_index = ak.values_astype(np.floor((ak.num(sorted_array, axis=1) - 1) / 2), np.int64)
>>> high_index = ak.values_astype(np.ceil((ak.num(sorted_array, axis=1) - 1) / 2), np.int64)
>>> low_index, high_index
(<Array [1, -1, 0, 0, 1] type='5 * int64'>,
 <Array [1, 0, 1, 0, 2] type='5 * int64'>)

Now we're still not there because these indexes should apply to the second (jagged) dimension, but here they are in the first dimension, as flat arrays. If you try to just slice with this, it will be trying to find the list at index 1, then the list at index -1, etc.

>>> sorted_without_empties[low_index].tolist()
[None, [6.6, 7.7, 8.8, 9.9], [0.0, 1.1, 2.2], [0.0, 1.1, 2.2], None]

There's no error there, but it's not at all what we want. It seems like you ought to be able to do a slice like [:, low_index], but that means requesting indexes 1, -1, 0, 0, 1 from every list.

>>> sorted_without_empties[:, low_index]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.8/site-packages/awkward/highlevel.py", line 1007, in __getitem__
    return ak._util.wrap(self._layout[where], self._behavior)
ValueError: in ListArray64 attempting to get 1, index out of range

That's also not what we want to do (even if it were possible): we want to get 1 from the first list, skip -1 (because that's None), 0 from the third list, etc. For that, we need the low_index and high_index to be jagged arrays of integers with the same lengths at all dimensions but the innermost one. It has to be shaped like [[1], [-1], [0], [0], [1]]. NumPy's np.newaxis does that:

>>> low_index[:, np.newaxis]
<Array [[1], [-1], [0], [0], [1]] type='5 * 1 * int64'>

though it's going to be relevant that this inner dimension has fixed-length 1, rather than a variable length that happens to be 1. Arrays with fixed-length dimensions have to reproduce what NumPy does, which often isn't what we want, so some behaviors hinge on whether a dimension is fixed-length or variable length. For instance, NumPy says that fixed-length dimensions should be broadcasted to the length of the other dimension, which is useful in some cases:

>>> sorted_without_empties[low_index[:, np.newaxis]].tolist()
[[None], [[6.6, 7.7, 8.8, 9.9]], [[0.0, 1.1, 2.2]], [[0.0, 1.1, 2.2]], [None]]

but not ours. Since the behaviors depend on fixed-length versus variable-length dimensions, there are functions for converting between them: ak.to_regular and ak.from_regular.

>>> ak.from_regular(low_index[:, np.newaxis])
<Array [[1], [-1], [0], [0], [1]] type='5 * var * int64'>

This one does exactly what we want.

>>> sorted_without_empties[ak.from_regular(low_index[:, np.newaxis])].tolist()
[[1.1], None, [3.3], [5.5], [7.7]]

Now we can get the low and high values in each list.

>>> low = sorted_without_empties[ak.from_regular(low_index[:, np.newaxis])]
>>> high = sorted_without_empties[ak.from_regular(high_index[:, np.newaxis])]
>>> low, high
(<Array [[1.1], None, [3.3], [5.5], [7.7]] type='5 * option[var * float64]'>,
 <Array [[1.1], None, [4.4], [5.5], [8.8]] type='5 * option[var * float64]'>)

The median is the average of these two.

>>> (low + high) / 2
<Array [[1.1], None, [3.85], [5.5], [8.25]] type='5 * option[var * float64]'>

Or, actually, it's the first element of each of these.

>>> ((low + high) / 2)[:, 0]
<Array [1.1, None, 3.85, 5.5, 8.25] type='5 * ?float64'>

Whew! Given how much work this was, it should be wrapped up as a library function, though it would need to be generalized to apply to any axis (by making slices with slice(None) * ndim + (np.newaxis,), rather than :, np.newaxis, etc.). All of the other functions in ak.operations.reducers.* can handle weights. I'm not really sure what to do with weights in a median...

jpivarski avatar Feb 15 '21 16:02 jpivarski

Thanks for the complete explanation of how to do the indexing properly!

mariogeiger avatar Feb 16 '21 08:02 mariogeiger