occa icon indicating copy to clipboard operation
occa copied to clipboard

Dynamic @exclusive sizes

Open dmed256 opened this issue 6 years ago • 14 comments

@exclusive array sizes for CPU modes are hard-coded to 256

We should use allocate the size depending on the full @inner loop size

dmed256 avatar Jun 03 '18 14:06 dmed256

What do you do when there is more than one @inner loop sharing the same @exclusive variable, but those @inner loops have different sizes? Do the sizes of all @inner loops within the same @outer loop have to have the same size? If so, I suppose there is no issue, except you really ought to verify that either statically at compile time if possible, or dynamically at runtime. But if they can have different sizes, then it brings into question what @exclusive means logically in terms of usage.

I am wondering if @exclusive is an idea that is best associated with per-thread use cases, rather than per-loop-iteration use cases? The former can be accommodated readily since the max number of threads can be determined and used to size a @exclusive shared array variable.

pdhahn avatar Jul 25 '18 14:07 pdhahn

Right now, @inner loops need to share the same number of iterations. However, @exclusive arrays are hard-coded to 256 entries which creates hard-to-debug bugs

I agree there should be a runtime check before launching kernels to verify the sizes

There isn't any thread local storage-like memory in OCCA, not sure how that would map to GPU memory but maybe something to keep in mind :)

dmed256 avatar Jul 25 '18 14:07 dmed256

Maybe only require (emit verification code) that all @inner loops have the same size when @exclusive is present. If @exclusive is not present within the @outer loop, then allow @inner loops to have different sizes.

pdhahn avatar Jul 25 '18 15:07 pdhahn

@pdhahn Make sure to use

`

to escape @attributes to prevent emailing random people

dmed256 avatar Jul 25 '18 15:07 dmed256

oops :-)

pdhahn avatar Jul 25 '18 15:07 pdhahn

np, I was accidentally emailing @dim a lot and he kindly replied with that info

dmed256 avatar Jul 25 '18 15:07 dmed256

The @inner loop size restriction is due to GPU blocks/work groups needing to be the same size :(

dmed256 avatar Jul 25 '18 15:07 dmed256

Yes that is partly the motivation for my earlier allusion to a thread-oriented meaning for exclusive vs. an iteration-oriented meaning. But the loop variable lower and upper bounds, at least as specified by the OKL programmer, are arbitrary and are what is proposed to determine the size of the exclusive variable array, correct?

pdhahn avatar Jul 25 '18 15:07 pdhahn

Yeah, based on the number of iterations and like the docs say

The concept of exclusive memory is similar to thread-local storage, where a single variable actually has one value per thread.

so it's more like TLS than iterations

dmed256 avatar Jul 25 '18 20:07 dmed256

OK. I think I misinterpreted your first comment about allocating the exclusive memory array based on "full inner loop size", where I thought you meant the latter was defined by the loop index variable bounds at the logical OKL program code level, as specified by the OKL programmer, so there would always be one array element per iteration (unrelated to threads). But one element per thread (TLS) makes total sense, at least when the inner loop index variable does not exceed the max. number of threads per block. Like you said, the latter can be readily determined, e.g. as device work group size.

BTW it would be ideal if the OKL programmer did not have to consider any issues related to physical device constraints on granularity of the parallelization in the outer/inner loops (i.e., how computationally, for the ubiquitous block-oriented topology assumed by OCCA, the device hardware dimensions map to logical dimensions), such as max threads per work group. Ideally, that is all abstracted away for him completely, and he is free to specify outer/inner dimensions based on the raw, ungrouped extent of the data to be processed (e.g., like we can do using OpenMP parallel for).. Or practically, abstract away at least as much as possible. OCCA/OKL goes a really long way in this regard, but may not be at the ideal point quite yet.

pdhahn avatar Jul 25 '18 21:07 pdhahn

I think your first interpretation was right, I meant the concept was similar

  • TLS: 1 thread - 1 value
  • @exclusive: 1 iteration - 1 value

it would be ideal if the OKL programmer did not have to consider any issues related to physical device constraints on granularity of the parallelization in the outer/inner loops

👍 I agree

It might mean OKL auto-tiles outer and inner loops if the loops go out of the device bounds (like too many threads or too many iterations for exclusives)

dmed256 avatar Jul 25 '18 22:07 dmed256

A note - I've run into memory errors related to this limitation when the size of inner(0) > 256.

jlchan avatar Feb 14 '19 20:02 jlchan

@jlchan sorry about that! Maybe we should increase the number as a temporary fix

dmed256 avatar Feb 14 '19 23:02 dmed256

no worries. I don't need it at the moment, but would it be useful to just add a warning flag during the OKL build?

jlchan avatar Feb 14 '19 23:02 jlchan