loopy
loopy copied to clipboard
Global reduction
The following script:
import loopy as lp
import numpy as np
ngroups = 2
group_size = 4
knl = lp.make_kernel(
"{[i]: 0<=i<400}",
"""
out = sum(i, x[i] ** 2)
""",
[lp.GlobalArg("x", dtype=np.float64, shape=lp.auto),
...])
knl = lp.split_iname(knl, "i", ngroups * group_size)
knl = lp.split_iname(knl, "i_inner", group_size,
inner_tag="l.0", outer_tag="g.0")
knl = lp.split_reduction_inward(knl, "i_outer")
print(lp.generate_code_v2(knl).device_code())
fails with the error message:
loopy.diagnostic.LoopyError: the only form of parallelism supported by reductions is 'local'--found iname(s) 'i_inner_outer' respectively tagged 'frozenset({GroupInameTag(axis=0)})'
I.e. reduction across groups isn't supported. I think re-writing the final stage reduction as an atomic-sum instruction should be fine. Is there a reason this was avoided?
In either case, you'll need a global barrier. After that, you might as well run a (short!) sequential reduction loop, which is going to be faster (and matches best practices for GPU reduction).