loopy
loopy copied to clipboard
lp.buffer_array allocates more memory than necessary
Consider the kernel:
import loopy as lp
knl = lp.make_kernel(
"{[e, i, j]: 0<=e<100000 and 0<=i, j<32}",
"""
y[e, i] = sum(j, A[i, j] * x[e, j])
""")
knl = lp.split_iname(knl, "i", 8, inner_tag="l.0")
knl = lp.buffer_array(knl, "y", "i_outer",
default_tag=None, temporary_is_local=False)
that generates the following kernel:
---------------------------------------------------------------------------
KERNEL: loopy_kernel
---------------------------------------------------------------------------
ARGUMENTS:
A: type: <auto/runtime>, shape: (32, 32), dim_tags: (N1:stride:32, N0:stride:1) aspace: global
x: type: <auto/runtime>, shape: (100000, 32), dim_tags: (N1:stride:32, N0:stride:1) aspace: global
y: type: <auto/runtime>, shape: (100000, 32), dim_tags: (N1:stride:32, N0:stride:1) aspace: global
---------------------------------------------------------------------------
DOMAINS:
{ [e, j, i_outer, i_inner, y_init_1, y_store_1] : (y_init_1) mod 8 = 0 and (y_store_1) mod 8 = 0 and 0 <= e <= 99999 and 0 <= j <= 31 and i_inner >= 0 and -8i_outer <= i_inner <= 31 - 8i_outer and i_inner <= 7 and -i_inner <= y_init_1 <= 31 - i_inner and -i_inner <= y_store_1 <= 31 - i_inner }
---------------------------------------------------------------------------
INAME TAGS:
e: None
i_inner: l.0
i_outer: None
j: None
y_init_1: None
y_store_1: None
---------------------------------------------------------------------------
TEMPORARIES:
y_buf: type: <auto/runtime>, shape: (25), dim_tags: (N0:stride:1) scope:private
---------------------------------------------------------------------------
INSTRUCTIONS:
for i_inner, e, y_init_1
↱ y_buf[y_init_1] = y[e, i_inner + y_init_1] {id=init_y}
│ end y_init_1
│ for i_outer
└↱ y_buf[8*i_outer] = reduce(sum, [j], A[i_inner + i_outer*8, j]*x[e, j]) {id=insn}
│ end i_outer
│ for y_store_1
└ y[e, i_inner + y_store_1] = y_buf[y_store_1] {id=store_y, no_sync_with=init_y@any}
end i_inner, e, y_store_1
---------------------------------------------------------------------------
I was expecting that y_buf should be a 4-long private array with unit-stride access. However it comes out as a 25-long array with a 8-stride access :astonished:. Anything preventing us from emitting the former?
I think what you're encountering is that buffer_array (along with precompute) currently lay out the buffer in the same way as the underlying array. Hermite normal form is, AFAIK, the correct way to compute a compression mapping. I think I looked at one point, and there exists an implementation of that in isl.