loopy icon indicating copy to clipboard operation
loopy copied to clipboard

Negative loop increment?

Open thisiscam opened this issue 4 years ago • 9 comments
trafficstars

Hi,

Does loopy support scheduling to negative loop increment? I tried the following:

import numpy as np
import pyopencl as cl
import pyopencl.array
import pyopencl.clrandom

import loopy as lp

lp.set_caching_enabled(False)
from loopy.version import LOOPY_USE_LANGUAGE_VERSION_2018_2

from warnings import filterwarnings, catch_warnings

filterwarnings('error', category=lp.LoopyWarning)

ctx = cl.create_some_context(interactive=True)
queue = cl.CommandQueue(ctx)

n = 1000
a_bk = cl.array.arange(queue, n, dtype=np.float32)
a = a_bk.copy()

knl = lp.make_kernel(
    "{ [i]: 0<=i<n }",
    "a[i] = 1 + a[i+1]",
    assumptions="n>=0",
)

knl = lp.set_options(knl, "write_code")
evt, (out,) = knl(queue, a=a)

and it gives me the following, seemingly incorrect code.

#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global float *__restrict__ a, int const n)
{
  for (int i = 0; i <= -1 + n; ++i)
    a[i] = 1.0f + a[1 + i];
}

I expected the loop to be reversed, calculating lower indices from higher indices. Is this forbidden in loopy? Where can I find out about the scheduling algorithm used by loopy?

thisiscam avatar Oct 12 '21 05:10 thisiscam

No it doesn't support negative loop increments, but the index inames could be manipulated to mimic the accesses as negative increments.

knl = lp.make_kernel(
    "{[i]: 0<=i<n}",
    """
    a[i] = 1 + a[i+1]
    """)

knl = lp.map_domain(knl,
                    isl.BasicMap("[n] -> {[i] -> [rev_i]: rev_i = n - 1- i}"))

print(lp.generate_code_v2(knl).device_code())

generates

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ a, int const n)
{
  for (int rev_i = 0; rev_i <= -1 + n; ++rev_i)
    a[-1 + -1 * rev_i + n] = 1 + a[-1 * rev_i + n];
}

kaushikcfd avatar Oct 12 '21 05:10 kaushikcfd

Ok I see.

I'm confused with how does loopy iterate through the (original) domain. My guess: it always 1) go from left to right of each iname in the space 2) iterate each iname from lower indices to higher indices? So for example, the domain {[i, j] : 0 <= i < N & 0 <= j < i} would produce the loop:

for i in range(0, N):
    for j in range(0, i):
       ....

?

thisiscam avatar Oct 12 '21 05:10 thisiscam

Where can I find out about the scheduling algorithm used by loopy?

AFAIK, there isn't a concrete document that outlines this. (cc @inducer?)

The only constraint that loopy's scheduler currently takes into account is the dependencies imposed by the statement DAG and the nesting constraints imposed by other inames of the statements. So that's why it generated the OpenCL kernel you saw earlier.

kaushikcfd avatar Oct 12 '21 05:10 kaushikcfd

The only constraint that loopy's scheduler currently takes into account is the dependencies imposed by the statement DAG and the nesting constraints imposed by other inames of the statements. So that's why it generated the OpenCL kernel you saw earlier.

Yeah, that was my guess. I was wondering what's the rationale behind this design decision?

My use case is that I just need a code generator. I can provide an affine schedule myself, if necessary. It would be really good if there's a low level API in loopy that I can just plug in that schedule and get back code.

thisiscam avatar Oct 12 '21 05:10 thisiscam

So for example, the domain {[i, j] : 0 <= i < N & 0 <= j < i} would produce the loop:

Unless the user forces more constraints via prioritize_loops both the following are possible:

for i in range(0, N):
    for j in range(0, i):
       ....

OR

for j in range(0, N-1):
    for i in range(j+1, N):
       ....

Also note that If there was a statement in the above kernel exclusively in the iname 'i' or exclusively in the iname 'j' then that constraint would also be enough to determine the outer-loop of the two.

kaushikcfd avatar Oct 12 '21 05:10 kaushikcfd

@kaushikcfd Oh I see. So there is some (seemingly undocumented) heuristic algorithm that is doing the scheduling? I guess it confused me when it recognizes inter-statement dependencies but not inner-statement dependencies.

Is there a reason loopy went with a customized scheduling algorithm instead of something like Pluto?

thisiscam avatar Oct 12 '21 05:10 thisiscam

Is there a reason loopy went with a customized scheduling algorithm instead of something like Pluto?

Loopy intends to solve slightly different problems. It is a transformation engine that borrows concepts from the polyhedral representation to provide access to loop/data transformations, but does not restrict to only polyhedral programs, non-affine accesses are just fine you just lose the ability to perform certain loop transformations on them; codegen etc. will work fine. Since, it is a transformation engine, reordering the schedule based on an internal heuristic goes against it being a transparent transformation engine.

kaushikcfd avatar Oct 12 '21 14:10 kaushikcfd

reordering the schedule based on an internal heuristic

But there is still an implicitly assumed schedule by loopy?

The loopy documentation says that

loopy’s programming model is completely unordered by default.

There are two approaches I can think of under this model:

One approach is that the use can specify dependencies so that loopy will do scheduling (don't have to be Pluto, but it will be some scheduling algorithm). Loopy certainly does this to some extent (I can specify dep=...). But that seems restrictive, since currently I can't specify instance-wise dependencies?

Another approach is to let the user specify the schedule (which seems to be what map_domain is doing, to some extent).

It's unclear which approach loopy is taking -- my intuition is that it is doing something of a mix of the two, yet undocumented what the tradeoffs are?

thisiscam avatar Oct 12 '21 17:10 thisiscam

But there is still an implicitly assumed schedule by loopy?

When a kernel is under-constrained, then the code is generated for one of the possible schedules. This was taken to favor better user experience.

Another approach is to let the user specify the schedule (which seems to be what map_domain is doing, to some extent).

The user could further influence the scheduling by using transformations.

Another approach is to let the user specify the schedule (which seems to be what map_domain is doing, to some extent).

Loopy's code-generator always emits loops from 'start' to 'end' in unit increments, 'start' and 'end' are piecewise affine expressions in terms of the kernel arguments and the inames that are nested outside the loop. So using that and lp.map_domain, the instances of each statement could be ordered to follow certain access pattern. (But I agree, this should be engraved in gold somewhere in the docs. Let's keep this issue open until the docs includes this)

kaushikcfd avatar Oct 12 '21 17:10 kaushikcfd