occa
occa copied to clipboard
Dynamic @exclusive sizes
@exclusive
array sizes for CPU modes are hard-coded to 256
We should use allocate the size depending on the full @inner
loop size
What do you do when there is more than one @inner
loop sharing the same @exclusive
variable, but those @inner
loops have different sizes? Do the sizes of all @inner
loops within the same @outer
loop have to have the same size? If so, I suppose there is no issue, except you really ought to verify that either statically at compile time if possible, or dynamically at runtime. But if they can have different sizes, then it brings into question what @exclusive
means logically in terms of usage.
I am wondering if @exclusive
is an idea that is best associated with per-thread use cases, rather than per-loop-iteration use cases? The former can be accommodated readily since the max number of threads can be determined and used to size a @exclusive
shared array variable.
Right now, @inner
loops need to share the same number of iterations. However, @exclusive
arrays are hard-coded to 256
entries which creates hard-to-debug bugs
I agree there should be a runtime check before launching kernels to verify the sizes
There isn't any thread local storage-like memory in OCCA, not sure how that would map to GPU memory but maybe something to keep in mind :)
Maybe only require (emit verification code) that all @inner
loops have the same size when @exclusive
is present. If @exclusive
is not present within the @outer
loop, then allow @inner
loops to have different sizes.
@pdhahn Make sure to use
`
to escape @attributes
to prevent emailing random people
oops :-)
np, I was accidentally emailing @dim
a lot and he kindly replied with that info
The @inner
loop size restriction is due to GPU blocks/work groups needing to be the same size :(
Yes that is partly the motivation for my earlier allusion to a thread-oriented meaning for exclusive vs. an iteration-oriented meaning. But the loop variable lower and upper bounds, at least as specified by the OKL programmer, are arbitrary and are what is proposed to determine the size of the exclusive variable array, correct?
Yeah, based on the number of iterations and like the docs say
The concept of exclusive memory is similar to thread-local storage, where a single variable actually has one value per thread.
so it's more like TLS than iterations
OK. I think I misinterpreted your first comment about allocating the exclusive memory array based on "full inner loop size", where I thought you meant the latter was defined by the loop index variable bounds at the logical OKL program code level, as specified by the OKL programmer, so there would always be one array element per iteration (unrelated to threads). But one element per thread (TLS) makes total sense, at least when the inner loop index variable does not exceed the max. number of threads per block. Like you said, the latter can be readily determined, e.g. as device work group size.
BTW it would be ideal if the OKL programmer did not have to consider any issues related to physical device constraints on granularity of the parallelization in the outer/inner loops (i.e., how computationally, for the ubiquitous block-oriented topology assumed by OCCA, the device hardware dimensions map to logical dimensions), such as max threads per work group. Ideally, that is all abstracted away for him completely, and he is free to specify outer/inner dimensions based on the raw, ungrouped extent of the data to be processed (e.g., like we can do using OpenMP parallel for
).. Or practically, abstract away at least as much as possible. OCCA/OKL goes a really long way in this regard, but may not be at the ideal point quite yet.
I think your first interpretation was right, I meant the concept was similar
- TLS: 1 thread - 1 value
-
@exclusive
: 1 iteration - 1 value
it would be ideal if the OKL programmer did not have to consider any issues related to physical device constraints on granularity of the parallelization in the outer/inner loops
👍 I agree
It might mean OKL auto-tiles outer and inner loops if the loops go out of the device bounds (like too many threads or too many iterations for exclusives)
A note - I've run into memory errors related to this limitation when the size of inner(0) > 256.
@jlchan sorry about that! Maybe we should increase the number as a temporary fix
no worries. I don't need it at the moment, but would it be useful to just add a warning flag during the OKL build?