std-simd
std-simd copied to clipboard
Missing documentation about how to use the library.
Hi there, I am an experienced C++ programmer but I'm completely lost when it comes to SIMD operations. Currently I'm trying your library for over a week and I still cannot figure out, how to get it to be more performant than the straight forward way.
In my particular case, I am trying to create a SAXPY operation according to BLAS standard using SIMD operations. My vectors are huge and still the straight forward way is much faster. I appended the two examples with my performance measurements to the bottom.
The copy operations to the buffer array are the most time consuming part. My suspicion is, that these copy operations are not needed and the filling of the native_simd<float> takes place in a lot more implicit way. However, I haven't figured it out how to do it yet and searching the source code confuses me even more.
By the way, I already copied the values directly to the native_simd<float> object, with very similar results.
May you please provide an example on how to actually provide data to the native_simd<float> vector properly? I would really appreciate it and will surely contribute some examples on how to use the library, since I think the documentation is the key, to get that neat library to the C++ core libraries.
Obvious way
template<class numeric_t>
void normal_axpy(const int n, const numeric_t a,
const numeric_t* x, const int inc_x, numeric_t* y, const int inc_y) {
for(auto i = 0, i_x = 0, i_y = 0; i < n; ++i, i_x += inc_x, i_y += inc_y) {
y[i_y] = a * x[i_x] + y[i_y];
}
}
Results
Results for 'Conventional'; Vector Dimension: 1000; Iterations: 10000
-- Min: 3 (3.00 µs)
-- Max: 45 (45.00 µs)
-- Average: 5.0503 (5.05 µs)
-- Median: 5 (5.00 µs)
-- Lower Quartile: 4 (4.00 µs)
-- Upper Quartile: 5 (5.00 µs)
Most probably wrong SIMD way
template<class numeric_t>
void axpy(const int n, const numeric_t a,
const numeric_t* x, const int inc_x, numeric_t* y, const int inc_y) {
const size_t _chunk_size = native_simd<numeric_t>::size();
native_simd<numeric_t> aa, xx, yy;
size_t i_x = 0, i_y = 0, i_yr = 0;
for(auto j = 0; j < _chunk_size; ++j)
aa[j] = a;
numeric_t buffer[_chunk_size];
for (size_t i = 0; i < n; i += _chunk_size) {
int num_elements = min(_chunk_size, n - i);
for(auto j = 0; j < num_elements; ++j, i_x += inc_x)
buffer[j] = x[i_x];
xx.copy_from(buffer, element_aligned);
for(auto j = 0; j < num_elements; ++j, i_y += inc_y)
buffer[j] = y[i_y];
yy.copy_from(buffer, element_aligned);
yy = aa * xx + yy;
yy.copy_to(buffer, element_aligned);
for(auto j = 0; j < num_elements; ++j, i_yr += inc_y)
y[i_yr] = buffer[j];
}
}
Results
Results for 'SIMD'; Vector Dimension: 1000; Iterations: 10000
-- Min: 18 (18.00 µs)
-- Max: 118 (118.00 µs)
-- Average: 26.0297 (26.03 µs)
-- Median: 27 (27.00 µs)
-- Lower Quartile: 21 (21.00 µs)
-- Upper Quartile: 28 (28.00 µs)
The problem seems to be inc_x and inc_y. If these were known to be 1, then you could simply write:
using V = native_simd<numeric_t>;
for (int i = 0; i < n; i += V::size()) {
V yv = a * V(x + i, element_aligned) + V(y + i, element_aligned);
yv.copy_to(y + i, element_aligned);
}
This gives you two load, one FMA, and one store instruction (which is the same as in the scalar case, except that in the SIMD case the loads, stores, and FMAs do way more work in one go). This perfectly fills up the execution ports on x86. But once the strides are larger than one, the CPU has to do 2*V::size() scalar load instructions and V::size() scalar store instructions per single SIMD FMA instruction. Consequently the whole time is spent in loads and stores (and shuffles, to create SIMD registers out of the scalar loads) and the FMA unit is mostly idle. Worse, AVX or even AVX-512 FMAs will reduce the CPU clock and thus make loads and stores even slower.
You could have a runtime condition on the strides to use vector loads and stores. If you often have strides of 1, this should improve your results. Else, reorganize your matrices so that the stride is statically 1.