awkward icon indicating copy to clipboard operation
awkward copied to clipboard

Prevent reducers like ak.sum on records (v1 and v2)

Open jpivarski opened this issue 2 years ago • 7 comments

Version of Awkward Array

HEAD

Description and code to reproduce

@chadfreer just brought this up, but I seem to remember @lgray and/or @nsmith- asking about it a long time ago.

The issue is this: if we have records whose fields represent non-Cartesian data, they should not be naively added (or otherwise "reduced").

>>> ak.sum(ak.Array([[{"rho": 1.1, "phi": -0.1}, {"rho": 2.2, "phi": 0.1}]]), axis=1)
<Array [{rho: 3.3, phi: 0}] type='1 * {"rho": float64, "phi": float64}'>

Knowing that this is a generic array, we can shrug and say, "I guess it would do that. It has no way of knowing what rho and phi mean." But if it were a VectorAwkward2D, we'd really expect it to take the meaning of rho and phi into consideration when it adds them, because that's what + does:

>>> import vector
>>> vector.register_awkward()
>>> (ak.Array([[{"rho": 1.1, "phi": -0.1}]], with_name="Vector2D") +
...  ak.Array([[{"rho": 2.2, "phi": 0.1}]], with_name="Vector2D")
<VectorArray2D [[{rho: 3.29, phi: 0.0334}]] type='1 * var * Vector2D["rho": floa...'>

When either @lgray or @nsmith- brought this up, I said that we couldn't let third-party libraries override reducers like ak.sum because they are implemented (in Awkward 1.x) in C++, where it would be very difficult to apply overrides implemented in Python. But in Awkward v2, the relevant code is Python and it would be possible.

But there are two issues here:

  1. The fact that Coffea and Vector override + and not ak.sum is dangerously misleading: attempts to ak.sum records should be prevented for exactly the same reason as + is prevented between non-overridden records. That is, apply the rationale behind #457 to reducers, not just ufuncs.
  2. Awkward v2 should provide hooks to override chosen reducers. The syntax could be:
ak._v2.behavior[ak._v2.sum, "VectorArray2D"] = special_function

and the special_function would be invoked somewhere in here, for ListOffsetArray for example. (Although I thought so while talking with @chadfreer, this can't be implemented through #1126, since this is a behavior that might be discovered deep inside an array, not at the level where a Python object exists.)

There would have to be some rules on when special_function is applied. Overrides for axis < the depth of the matched name (e.g. "VectorArray2D") wouldn't make a lot of sense, since they cross list/record object boundaries:

>>> ak.sum(ak.Array([[1, 2, 3], [100, 200]]), axis=0)
<Array [101, 202, 3] type='3 * int64'>

While thinking about expanding capabilities in v2 is fun, item (1) in the above list is the higher-priority item: we need to avoid misleading users into thinking that ak.sum(geometric_vectors) does what they might think it does. An error message is better than a wrong answer.

(This issue is a rare bug-feature-policy!)

jpivarski avatar Mar 17 '22 18:03 jpivarski

Note that until this gets fixed, users will need some way to add up arbitrary numbers of vectors per list in an Awkward Array. This example code does that with Numba:

import time

import numpy as np
import awkward as ak
import numba as nb
import vector

vector.register_awkward()

fake_o = {"pt": 1.1, "phi": 2.2, "eta": 3.3, "mass": 10}

array = ak.Array(
    [
        [fake_o, fake_o, fake_o],
        [],
        [fake_o, fake_o],
        [fake_o, fake_o, fake_o, fake_o],
    ],
    with_name="Momentum4D",
)

@nb.jit
def add_lists_of_vectors(array):
    output_px = np.empty(len(array))
    output_py = np.empty(len(array))
    output_pz = np.empty(len(array))
    output_E = np.empty(len(array))

    for index, single_list in enumerate(array):
        sum_of_vectors = vector.obj(px=0.0, py=0.0, pz=0.0, E=0.0)

        for input_vector in single_list:
            sum_of_vectors = sum_of_vectors + input_vector
            pass

        output_px[index] = sum_of_vectors.px
        output_py[index] = sum_of_vectors.py
        output_pz[index] = sum_of_vectors.pz
        output_E[index] = sum_of_vectors.E

    return output_px, output_py, output_pz, output_E

starttime = time.time()
output_px, output_py, output_pz, output_E = add_lists_of_vectors(array)
print("compilation + runtime:", time.time() - starttime)

output = ak.zip(
    {"px": output_px, "py": output_py, "pz": output_pz, "E": output_E},
    with_name="Momentum4D",
)

print(output)

The use of Numba (or some compilation somewhere) is necessary. In some tests, I found that Numba compilation time is 1.5 seconds and runtime begins to dominate when you multiply the length of the fake_o example by ten million. Then the runtime is 0.6 microseconds per fake-o group.

But the pure Python runtime per fake-o group (i.e. drop the @nb.jit) is 0.1 seconds, which is 187000× slower. That's staggering, but there are a lot of culprits for the difference: indirection in Awkward Array __getitem__, changes of coordinates in Vector, ... Python itself (if you were to streamline it by replacing the Awkward Arrays and Vectors with builtin lists and class instances with __slots__) shouldn't be much more than 100× or 1000× slower.

The reason the output is given as NumPy arrays for each component (output_px, etc.) is because of https://github.com/scikit-hep/vector/issues/43. If you want the output as an array of non-Cartesian vectors, substitute sum_of_vectors.pt, etc. for sum_of_vectors.px, etc.

jpivarski avatar Apr 05 '22 16:04 jpivarski

We implement this vector sum as a mixin method in coffea vector classes: array.sum(axis=1) simply sums each cartesian component: e.g. this 2-vector implementation copied from here


    def sum(self, axis=-1):
        """Sum an array of vectors elementwise using `x` and `y` components"""
        out = awkward.zip(
            {
                "x": awkward.sum(self.x, axis=axis),
                "y": awkward.sum(self.y, axis=axis),
            },
            with_name="TwoVector",
            highlevel=False,
        )
        return awkward.Array(out, behavior=self.behavior)

Unless I am missing something, I don't see why it would need to be done in numba.

nsmith- avatar Apr 06 '22 13:04 nsmith-

You're not missing anything; I just didn't think of that yesterday when I was talking with @kpedro88. What you have here works just as well.

This issue is to prevent ak.sum from doing misleading computations with non-Cartesian vectors, and v2 would eventually make it possible for Vector to override ak._v2.sum to do the right thing. The only difference between that and your solution is that it would be in ak._v2.sum, rather than a .sum method. (That distinction could be called "cosmetic" if it were not for the fact that the ak._v2.sum function still exists and we don't want them to do different things, one correct and the other incorrect.)

In your uses, is axis != -1 ever meaningful? Actually, I guess it may be:

array = [
    [vector_A, vector_B, vector_C],
    [vector_D,     None, vector_E],
    [vector_F],
]

I guess there are cases when you'd want to get

[vector_A + vector_D + vector_F, vector_B, vectorC + vector_E]

I had been thinking of having the reducer-overload exclude axis != -1, but I guess we shouldn't, just as you don't.

jpivarski avatar Apr 06 '22 16:04 jpivarski

The other difference is that the Numba solution adds the vectors in their native representation, while the Coffea solution requires converting to Cartesian, correct?

kpedro88 avatar Apr 06 '22 16:04 kpedro88

That's true, though I'm pretty sure that all of Vector's + operations convert to Cartesian. The Numba solution leaves open the possibility of Vector adding more optimized + in the future.

Nope! I'm wrong! If two vectors both have rho-phi azimuthal coordinates, they will be added in rho-phi that system and the result will be rho-phi.

https://github.com/scikit-hep/vector/blob/6553752691fd41120ab095e2a5c875e894e475cf/src/vector/_compute/planar/add.py#L45-L53

So, for instance, these vectors stay in rho-phi:

>>> vector.obj(rho=1.1, phi=0.3) + vector.obj(rho=2.2, phi=-0.4)
vector.obj(rho=3.122793010504687, phi=-0.17108096774473536)

But anything else will go to x-y:

>>> vector.obj(rho=1.1, phi=0.3) + vector.obj(x=2.2, y=4.4)
vector.obj(x=3.2508701380381666, y=4.725072227327474)

Longitudinal and temporal coordinates always convert to Cartesian before adding. (Currently. Changing that shouldn't break anything downstream if you always get coordinates from properties, rather than field values.)

jpivarski avatar Apr 06 '22 17:04 jpivarski

If you're summing a bunch then doing the cartesian conversion once is more efficient anyway

nsmith- avatar Apr 06 '22 19:04 nsmith-

The two issues described here have been split: the part about adding the ability to override reducers in v2 is now issue #1423.

jpivarski avatar Apr 15 '22 18:04 jpivarski