loopy
loopy copied to clipboard
What to do about separate "loop entry roots" sharing a domain
Consider:
knl = lp.make_kernel(
"{[i,j, l]: 0<=i<20 and 0<=j<=i and 0<=l<10}",
"""
z[l] = l
y[i,j] = i+j
""")
knl = lp.tag_inames(knl, "i:g.0, l:g.0")
print(lp.generate_code_v2(knl).device_code())
generates the code:
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ y, __global int *__restrict__ z)
{
if (9 + -1 * gid(0) >= 0)
z[gid(0)] = gid(0);
for (int j = 0; j <= gid(0); ++j)
y[20 * gid(0) + j] = gid(0) + j;
}
From #127 it was established that an instance of an instruction would be entered iff the domain it's inames are part of are all non-empty. Which is why I would have expected the generated code to be:
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ y, __global int *__restrict__ z)
{
if (9 + -1 * gid(0) >= 0)
z[gid(0)] = gid(0);
if (9 + -1 * gid(0) >= 0)
for (int j = 0; j <= gid(0); ++j)
y[20 * gid(0) + j] = gid(0) + j;
}
/cc @inducer
FWIW, I don't dislike the justification in #127. I'm concerned that it isn't that as strictly upheld i.e. there's some sort of difference between hardware axes inames and "other" kernel parameters, which doesn't seem appropriate.
Also, if there will be a proposition to make changes to the spec., it must be noted that it noticeably breaks back-compatibility. Such domains are frequently seen in the test suite pertaining prefetching tests (like optimizing gemms) where multiple non-equivalent hardware axes are part of the same domain basic set.
TBH, I'm a bit torn about this situation. (Btw hope you're OK with my retitling of this issue.)
The additional conditional you are asking for is the only reasonable thing to emit given this domain and kernel, IMO. So I am entirely OK with going that route and just eating the relating semantics change; the prior behavior (to me) makes no sense and should be considered a bug.
At the same time, I find it hard to argue that this whole situation makes a great deal of sense, so I'm also tempted to ~~require~~ prohibit (with a deprecation period, sure) separate "loop entry roots" (i and l, in your kernel above) sharing a domain. (I.e. inames in a domain are only allowed to cover loops that nest.) How do you feel about this direction?
I'm also tempted to require prohibit (with a deprecation period, sure) separate "loop entry roots" (i and l, in your kernel above) sharing a domain.
For a kernel with a kernel-wide outer loop (say an FEM action kernel), all the inames would get condensed into a single basic set which would lead to pretty expensive ISL computations.
I think just fixing the current bug in the code-generation should be fine. We just need to take care that the inames introduced by add_prefetch end up in different basic sets.
I had started my implementation in #372 by considering #127 as a tenet, so I'm planning to upstream those fixes with that PR.
For a kernel with a kernel-wide outer loop (say an FEM action kernel), all the inames would get condensed into a single basic set which would lead to pretty expensive ISL computations.
I'm actually pushing for the opposite: Each loopy entry root must have its own domain, so this would lead to finer-grain (rather than coarser-grain) domains.
IIUC the following loop nest:
for iel
for idof1
end idof1
for idof2
end idof2
.
.
.
for idof1000
end idof1000
end iel
Wouldn't this always lead to a kernel with single domain as: {[iel, idof1, idof2, ... idof1000]: ...}
(edited to clarify the downside of the approach)
The rule I'm proposing would prohibit separate loops in the tree from sharing a domain, in a way a sharpening of the "no branching" rule that exists. What it would mean is that, at any given nesting level of the tree, loop entries would not be allowed to share domains.
In an extension of your iel/idof* example:
for iel
for idof1
end idof1
for idof2
end idof
end
for iel2
for idof3
end
end
idof1andidof2would not be allowed to share a domain.ielandiel2would not be allowed to share a domain.iel2andidof3would be allowed to share a domain.- To be discussed: Would
ielandidof1be allowed to share a domain?
I'm liking it, it's precisely defined and I am unable to break it.
Would iel and idof1 be allowed to share a domain?
I would allow it as I don't see a reason not to.
One thing to note is that much of this is not supported in current loopy as it would require branching in the domain tree. For ex.:
knl = lp.make_kernel(
["{[i]: 0<=i<10}",
"[i] -> {[j]: 0<=j<=i }",
"[i] -> {[k]: 0<=k<=i }",
],
"""
x[i, j, k] = i+j+k
""")
raises:
loopy.diagnostic.CannotBranchDomainTree: iname set 'k, j, i' requires branch in domain tree (when adding 'j')