lpython icon indicating copy to clipboard operation
lpython copied to clipboard

Vectorization

Open certik opened this issue 3 years ago • 2 comments

We need to implement ASR->ASR passes that implement vectorization. It seems we write a pragma (comment) which will direct the optimizer to apply a given transformation and specify exactly which one and which parameters for the given loop. Much later we can think about creating some heuristic optimizer (possibly profile driven). In this issue we simply have to implement the mechanism and implement all options. So the design space is relatively limited. The hard part is to figure out how to best represent the vector operations in ASR.

Examples of loops:

Array copy

a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
# Equivalently: A[:] = B[:]
for i in range(n):
    A[i] = B[i]

->

a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
for i in range(0,n,1024):
    A[i*1024:(i+1)*1024] = B[i*1024:(i+1)*1024]
# plus a remainder loop

Possibly this can be transformed into:

a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
for i in range(0,n,1024):
    vector_copy(1024, A[i*1024:(i+1)*1024], B[i*1024:(i+1)*1024])
# plus a remainder loop

This all happens at the ASR level. Then the backend very straighforwardly transforms vector_copy(n, A, B) with n=1024 (or other well defined vector lengths) into a loop over AVX-512 or ARM vector instructions. Or generates LLVM code that LLVM itself knows how to efficiently vectorize.

Matrix-vector multiply

for i in range(n):
    y[i] = 0
    for j in range(n):
        y[i] += A[i,j] * x[j]

We can vectorize the inner loop:

for i in range(n):
    y[i] = 0
    for j in range(0,n,1024):
        y[i] += A[i,j*1024:(j+1)*1024] * x[j*1024:(j+1)*1024]

Or the outer loop:

for i in range(0,n,1024):
    y[i*1024:(i+1)*1024] = 0
    for j in range(n):
        y[i*1024:(i+1)*1024] += A[i*1024:(i+1)*1024,j] * x[j]

We can also permute the loop order, such as:

y[:] = 0
for j in range(n):
    for i in range(n):
        y[i] += A[i,j] * x[j]

And again we can vectorize the inner and outer loop.

So there are at least 4 possibilities and we have to implement all 4 and allow the user to select using a pragma. There might be more options (I can't remember, but one might perhaps unroll the outer loop but vectorize the inner loop, or something like that).

For the remainder loop there are again at least two options: either iterate exactly over the correct length and not vectorize, or pad it to the nearest vector length and also vectorize.

certik avatar Jun 19 '22 18:06 certik

Will start working on this after completing #651.

czgdp1807 avatar Jun 20 '22 14:06 czgdp1807

I would like to implement this. Can anyone provide resources to get started on? Is this similar to loop_vectorise in libasr/pass/ ?

tanay-man avatar Feb 10 '24 16:02 tanay-man