loopy icon indicating copy to clipboard operation
loopy copied to clipboard

What to do about separate "loop entry roots" sharing a domain

Open kaushikcfd opened this issue 4 years ago • 9 comments

Consider:

knl = lp.make_kernel(
        "{[i,j, l]: 0<=i<20 and 0<=j<=i and 0<=l<10}",
        """
        z[l] = l
        y[i,j] = i+j
        """)
knl = lp.tag_inames(knl, "i:g.0, l:g.0")

print(lp.generate_code_v2(knl).device_code())

generates the code:

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ y, __global int *__restrict__ z)
{
  if (9 + -1 * gid(0) >= 0)
    z[gid(0)] = gid(0);
  for (int j = 0; j <= gid(0); ++j)
    y[20 * gid(0) + j] = gid(0) + j;
}

From #127 it was established that an instance of an instruction would be entered iff the domain it's inames are part of are all non-empty. Which is why I would have expected the generated code to be:

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ y, __global int *__restrict__ z)
{
  if (9 + -1 * gid(0) >= 0)
    z[gid(0)] = gid(0);
  if (9 + -1 * gid(0) >= 0)
    for (int j = 0; j <= gid(0); ++j)
      y[20 * gid(0) + j] = gid(0) + j;
}

kaushikcfd avatar May 31 '21 16:05 kaushikcfd

/cc @inducer

kaushikcfd avatar May 31 '21 16:05 kaushikcfd

FWIW, I don't dislike the justification in #127. I'm concerned that it isn't that as strictly upheld i.e. there's some sort of difference between hardware axes inames and "other" kernel parameters, which doesn't seem appropriate.

Also, if there will be a proposition to make changes to the spec., it must be noted that it noticeably breaks back-compatibility. Such domains are frequently seen in the test suite pertaining prefetching tests (like optimizing gemms) where multiple non-equivalent hardware axes are part of the same domain basic set.

kaushikcfd avatar May 31 '21 16:05 kaushikcfd

TBH, I'm a bit torn about this situation. (Btw hope you're OK with my retitling of this issue.)

The additional conditional you are asking for is the only reasonable thing to emit given this domain and kernel, IMO. So I am entirely OK with going that route and just eating the relating semantics change; the prior behavior (to me) makes no sense and should be considered a bug.

At the same time, I find it hard to argue that this whole situation makes a great deal of sense, so I'm also tempted to ~~require~~ prohibit (with a deprecation period, sure) separate "loop entry roots" (i and l, in your kernel above) sharing a domain. (I.e. inames in a domain are only allowed to cover loops that nest.) How do you feel about this direction?

inducer avatar May 31 '21 17:05 inducer

I'm also tempted to require prohibit (with a deprecation period, sure) separate "loop entry roots" (i and l, in your kernel above) sharing a domain.

For a kernel with a kernel-wide outer loop (say an FEM action kernel), all the inames would get condensed into a single basic set which would lead to pretty expensive ISL computations.

I think just fixing the current bug in the code-generation should be fine. We just need to take care that the inames introduced by add_prefetch end up in different basic sets.

I had started my implementation in #372 by considering #127 as a tenet, so I'm planning to upstream those fixes with that PR.

kaushikcfd avatar May 31 '21 17:05 kaushikcfd

For a kernel with a kernel-wide outer loop (say an FEM action kernel), all the inames would get condensed into a single basic set which would lead to pretty expensive ISL computations.

I'm actually pushing for the opposite: Each loopy entry root must have its own domain, so this would lead to finer-grain (rather than coarser-grain) domains.

inducer avatar May 31 '21 17:05 inducer

IIUC the following loop nest:

for iel
  for idof1
  end idof1

  for idof2
  end idof2
 .
 .
 .

  for idof1000
  end idof1000

end iel

Wouldn't this always lead to a kernel with single domain as: {[iel, idof1, idof2, ... idof1000]: ...}

(edited to clarify the downside of the approach)

kaushikcfd avatar May 31 '21 17:05 kaushikcfd

The rule I'm proposing would prohibit separate loops in the tree from sharing a domain, in a way a sharpening of the "no branching" rule that exists. What it would mean is that, at any given nesting level of the tree, loop entries would not be allowed to share domains.

In an extension of your iel/idof* example:

for iel
  for idof1
  end idof1

  for idof2
  end idof
end
for iel2
  for idof3
  end
end
  • idof1 and idof2 would not be allowed to share a domain.
  • iel and iel2 would not be allowed to share a domain.
  • iel2 and idof3 would be allowed to share a domain.
  • To be discussed: Would iel and idof1 be allowed to share a domain?

inducer avatar May 31 '21 17:05 inducer

I'm liking it, it's precisely defined and I am unable to break it.

Would iel and idof1 be allowed to share a domain?

I would allow it as I don't see a reason not to.

kaushikcfd avatar May 31 '21 19:05 kaushikcfd

One thing to note is that much of this is not supported in current loopy as it would require branching in the domain tree. For ex.:

knl = lp.make_kernel(
    ["{[i]: 0<=i<10}",
     "[i] -> {[j]: 0<=j<=i }",
     "[i] -> {[k]: 0<=k<=i }",
     ],
    """
    x[i, j, k] = i+j+k
    """)

raises:

loopy.diagnostic.CannotBranchDomainTree: iname set 'k, j, i' requires branch in domain tree (when adding 'j')

kaushikcfd avatar Jun 01 '21 05:06 kaushikcfd