RAJA
RAJA copied to clipboard
Added Static Versions of TensorRef, TensorTileRef, VectorIndex, etc
Summary
This PR adds alternative types and specializations to structs and templates that involve the description of tensor bounds and strides. These alternatives and specializations represent these offsets, sizes, and strides statically though template arguments and static const values. This representation allows the calculation of these values to be inlined by the compiler when indexing tensors within a view.
The Problem being Addressed
When indexing tensors from a view, even if the layout and indexes are predetermined at compile time, the compiler uses the member arrays of TensorRef
and TensorTileRef
to calculate the offsets of the indexed data within the view. This is the case even with aggressive optimizations, and so leaves performance on the table by forcing the values in the TensorRef
's arrays to be loaded to calculate each index.
To eliminate this additional loading, the information contained within these arrays (strides,offsets,sizes) would need to be represented statically not only within TensorRef
, but within every function, type, and template used to get from an index operation all the way to its return. The types included in this PR serve this purpose by providing alternatives along each link in this chain of abstractions, allowing for both indexes and layouts to be represented statically or non-statically in any combination (including within expression templates) while preserving functionality.
Added Types
-
StaticTensorIndex
: Allows users to provide a compile-time representation of tensor indexes - [helper type]
StaticTensorIndexInner
: Used byStaticTensorIndex
so that templates which specialize based off ofStaticTensorIndex
but not the internal types/values can simply insert atypename
parameter when referring to its specialization -
StaticTensorRef
: The return type when indexing tensors in a view with aStaticLayout
usingStaticTensorIndex
s - [helper type]
RefBridge
: Prior to this PR, specializations ofTensorRegister
only provide implementations ofload_ref
andstore_ref
toTensorRefs
of the appropriate dimensionality. To extend these methods to includeStaticTensorRef
while maintaining this constraint, this inner type is required. -
StaticTensorTileRef
: Used internally byStaticTensorRef
and behaves as if it was aTensorTileRef
, apart from not having modifiable begin and size arrays - [helper type]
StaticIndexArray
: Used internally byStaticTensorRef
andStaticTensorTileRef
it is a wrapper type around acamp::int_seq
that pretends to be an array containing its argument list as values. By using this type, functions originally meant to work withTensorRef
andTensorTileRef
(and which don't need to modify the underlying data) can treat their static alternatives identically. In cases where array data does need to be altered, helper types such asSetStaticIndexArray
allow for the creation of new specializations ofStaticIndexArray
with certain elements set to different values. - [specialization]
TensorIndexTraits<StaticTensorIndex<...>>
: Required for other code to treatStaticTensorIndex
as though it wereTensorIndex
- [specialization]
MergeRefTile<StaticTensorRef<...>,StaticTensorTile<...>,camp::int_seq<...>>
: Implements compile-time merging of refs and tiles only if both are static -
StaticTensorTileExec
: Unrolls the iteration that comes with executing a body upon a tile, used when the tile is aStaticTensorTile
- [specialization]
ViewReturnHelper
: Peforms indexing of views withStaticLayoutBase_impl
as a layout type andStaticTensorIndex
as the index type at compile time
Testing
These changes have been tested - and completely passed - for the following platforms:
- toss3(quartz) : gcc, clang, icpc
- blueos(lassen): nvcc_gcc, nvcc_xl
Performance
Preliminary testing has shown performance approaching 0.70 to 0.85 times the performance achieved with intrinsics (as opposed to the ~0.5 times performance seen without the use of these static facilities).
It seems likely that further development could produce additional benefits. This is because the 0.85-times performance corresponds to the use of static types for all layouts and indexes except for the layout of the assigned view in an expression template. Making this last layout static decreases the performance to 0.70 times the intrinsic performance. However, making any other value non-static provides even worse performance - though nothing worse than the base non-static performance.
This seems to indicate some optimization is yet required. Nonetheless - even without this optimization - a 1.4-to-1.7 times speedup is an improvement.
Design review
Most of the types added are for internal use, as they mirror the functionality of internal types themselves. However, the StaticTensorIndex
template struct should be scrutinized for usability. Currently, to aid in the construction of StaticTensorIndex
specializations, the static template methods static_range
and static_all
have been added to the TensorIndex
type. These template methods mirror the usage of the range
and all
methods, but their inputs are provided as template arguments.
An alternative implementation would be to make range
and all
template methods, give them normal behavior when template arguments are not provided, and have them act like static_range
and static_all
if they are. While this is feasible with the use of arrow notation to change the return type based off of the specialization of the method, this change may make debugging more difficult for developers if the function is not used properly. Particularly, making these methods templates increases the likelihood developers would run into unfriendly mountains of template error messages if they do not use such methods properly.
I'll try to give this a more in-depth review, but as a start we definitely need some tests for this code. You mention that it has been tested, but it doesn't seem to be hooked into any of the raja tests yet.
sigh sorry about that, misclick 🤦
I'll try to give this a more in-depth review, but as a start we definitely need some tests for this code. You mention that it has been tested, but it doesn't seem to be hooked into any of the raja tests yet.
@braxtoncuneo and I will go over these changes at the next RAJA meeting 7/5. Hopefully you and @ajkunen can make it, and give us some feedback.
We still need to add tests, but Braxton had checked that the changes do not break existing tests.
That sounds great, I intend to be around for that meeting. The main question I’ll be thinking about when looking at this is how much of this we can fold out so users get this functionality automatically from the existing types when specifying static bounds. I don’t think we’ll get all of it, so the added types are probably a good idea, but I think some of the duplication might be able to fold out with some judicious use of C++14-extended constexpr.
This latest push rebases off of the most recent version of develop
and folds some minor commits into larger ones.
This latest push rebases off of the most recent version of
develop
and folds some minor commits into larger ones.
@braxtoncuneo Is it safe for me to make more test modifications to this branch now?
@ajkunen @trws Existing tests have been updated to use the new StaticLayout with vectorization. Please take a look at this PR when you have a chance.
@ajkunen @trws @davidbeckingsale Would you mind taking another look at this? Thanks!
This looks OK to me. I would like to get one or two more approvals before merging. Please take a look and approve or comment on changes you think should be made. Thank you.
@MrBurmark @ajkunen @trws do you want to take another look before merging this?