VectorRegister load_packed Only Loading First Value
VectorRegister implementations of load_packed and load_packed_n are only loading the first value of the array. Surprisingly, the = operator works as expected, and performs the load and store properly. The use of load_packed is required for sum/dot/min/max operations on a VectorRegister.
Does not work:
using vec_b = RAJA::expt::VectorRegister<double, RAJA::expt::scalar_register>;
vec_b vec_b_test;
vec_b_test.load_packed_n(&b[0], N);
vec_b_test.store_packed_n(&c[0], N);
Bad output:
b: 5 5 5 5
c: 5 0 0 0
Works:
using vec_b = RAJA::expt::VectorRegister<double, RAJA::expt::scalar_register>;
using idx_b = RAJA::expt::VectorIndex<int, vec_b>;
using VecB = RAJA::View<RAJA::Real_type,RAJA::StaticLayout<RAJA::PERM_I,N>>;
auto bV = VecB(b);
auto cV = VecB(v_c);
auto vec_all = idx_b::static_all();
cV(vec_all) = bV(vec_all);
Good output:
b: 5 5 5 5
c: 5 5 5 5
Standard Layout with the assignment operator also works.
This behavior is exactly what RAJA::expt::scalar_register is expected to do: load the first value of the array into the VectorRegister, due to being a scalar. At the time, we were using scalar_register on a BlueOS system, when we should have tried a cuda_warp_register. The more typical usage is on a TOSS system with one of the AVX register types, which operates correctly loading all elements of the array into the vector register.
We should document the functionality of scalar_register more clearly, and close this issue.