loopy icon indicating copy to clipboard operation
loopy copied to clipboard

Global reduction

Open kaushikcfd opened this issue 4 years ago • 1 comments

The following script:

import loopy as lp
import numpy as np

ngroups = 2
group_size = 4

knl = lp.make_kernel(
    "{[i]: 0<=i<400}",
    """
    out = sum(i, x[i] ** 2)
    """,
    [lp.GlobalArg("x", dtype=np.float64, shape=lp.auto),
     ...])


knl = lp.split_iname(knl, "i", ngroups * group_size)
knl = lp.split_iname(knl, "i_inner", group_size,
                     inner_tag="l.0", outer_tag="g.0")
knl = lp.split_reduction_inward(knl, "i_outer")
print(lp.generate_code_v2(knl).device_code())

fails with the error message:

loopy.diagnostic.LoopyError: the only form of parallelism supported by reductions is 'local'--found iname(s) 'i_inner_outer' respectively tagged 'frozenset({GroupInameTag(axis=0)})'

I.e. reduction across groups isn't supported. I think re-writing the final stage reduction as an atomic-sum instruction should be fine. Is there a reason this was avoided?

kaushikcfd avatar Aug 01 '21 01:08 kaushikcfd

In either case, you'll need a global barrier. After that, you might as well run a (short!) sequential reduction loop, which is going to be faster (and matches best practices for GPU reduction).

inducer avatar Aug 03 '21 20:08 inducer