loopy lp.buffer_array allocates more memory than necessary

lp.buffer_array allocates more memory than necessary

Open kaushikcfd opened this issue 3 years ago • 1 comments

Consider the kernel:

import loopy as lp


knl = lp.make_kernel(
    "{[e, i, j]: 0<=e<100000 and 0<=i, j<32}",
    """
    y[e, i] = sum(j, A[i, j] * x[e, j])
    """)

knl = lp.split_iname(knl, "i", 8, inner_tag="l.0")
knl = lp.buffer_array(knl, "y", "i_outer",
                      default_tag=None, temporary_is_local=False)

that generates the following kernel:

---------------------------------------------------------------------------
KERNEL: loopy_kernel
---------------------------------------------------------------------------
ARGUMENTS:
A: type: <auto/runtime>, shape: (32, 32), dim_tags: (N1:stride:32, N0:stride:1) aspace: global
x: type: <auto/runtime>, shape: (100000, 32), dim_tags: (N1:stride:32, N0:stride:1) aspace: global
y: type: <auto/runtime>, shape: (100000, 32), dim_tags: (N1:stride:32, N0:stride:1) aspace: global
---------------------------------------------------------------------------
DOMAINS:
{ [e, j, i_outer, i_inner, y_init_1, y_store_1] : (y_init_1) mod 8 = 0 and (y_store_1) mod 8 = 0 and 0 <= e <= 99999 and 0 <= j <= 31 and i_inner >= 0 and -8i_outer <= i_inner <= 31 - 8i_outer and i_inner <= 7 and -i_inner <= y_init_1 <= 31 - i_inner and -i_inner <= y_store_1 <= 31 - i_inner }
---------------------------------------------------------------------------
INAME TAGS:
e: None
i_inner: l.0
i_outer: None
j: None
y_init_1: None
y_store_1: None
---------------------------------------------------------------------------
TEMPORARIES:
y_buf: type: <auto/runtime>, shape: (25), dim_tags: (N0:stride:1) scope:private
---------------------------------------------------------------------------
INSTRUCTIONS:
   for i_inner, e, y_init_1
↱        y_buf[y_init_1] = y[e, i_inner + y_init_1]  {id=init_y}
│      end y_init_1
│      for i_outer
└↱       y_buf[8*i_outer] = reduce(sum, [j], A[i_inner + i_outer*8, j]*x[e, j])  {id=insn}
 │     end i_outer
 │     for y_store_1
 └       y[e, i_inner + y_store_1] = y_buf[y_store_1]  {id=store_y, no_sync_with=init_y@any}
   end i_inner, e, y_store_1
---------------------------------------------------------------------------

I was expecting that y_buf should be a 4-long private array with unit-stride access. However it comes out as a 25-long array with a 8-stride access :astonished:. Anything preventing us from emitting the former?

Feb 05 '22 19:02 kaushikcfd

I think what you're encountering is that buffer_array (along with precompute) currently lay out the buffer in the same way as the underlying array. Hermite normal form is, AFAIK, the correct way to compute a compression mapping. I think I looked at one point, and there exists an implementation of that in isl.

Feb 06 '22 21:02 inducer

loopy loopy copied to clipboard

lp.buffer_array allocates more memory than necessary

loopy
loopy copied to clipboard