libsimdpp
libsimdpp copied to clipboard
[Load/Store] Very slow
Hi When i made benchmark on std::like algorithm (see #107 and #115) i felt onto surprising behavior std::transform algorithm is faster than simd counterpart WTF i said on my desk I profiled the code and suspects are generalized load/store functions I made a dedicated benchmark suite on them And yes, load/store seems to be slow
On avx2 architecture, Ubuntu 5.4.0-6ubuntu1~16.04.9 g++ 5.4.0
i compare c++ for loop and simd like for loop ` for (size_t i = 0; i < size; ++i) { *ptrout++=*ptrin++; }
Versus
for (size_t i = 0; i < size; i += simd_size)
{
simd_type_T element = simdpp::load(ptrin);
simdpp::store(ptrout, element);
ptrin += simd_size;
ptrout += simd_size;
}
Results for uint8_t and various size
LoadStoreFixture<uint8_t>/UnaryUNINT8_SIMD_Test/64 72 ns 72 ns 9921739
LoadStoreFixture<uint8_t>/UnaryUNINT8_SIMD_Test/128 146 ns 146 ns 4629647
LoadStoreFixture<uint8_t>/UnaryUNINT8_SIMD_Test/256 288 ns 288 ns 2460696
LoadStoreFixture<uint8_t>/UnaryUNINT8_SIMD_Test/512 573 ns 573 ns 1200729
LoadStoreFixture<uint8_t>/UnaryUNINT8_SIMD_Test/1024 1116 ns 1116 ns 634524
LoadStoreFixture<uint8_t>/UnaryUNINT8_STD_Test/64 30 ns 30 ns 23157949
LoadStoreFixture<uint8_t>/UnaryUNINT8_STD_Test/128 56 ns 56 ns 12685794
LoadStoreFixture<uint8_t>/UnaryUNINT8_STD_Test/256 104 ns 104 ns 6773528
LoadStoreFixture<uint8_t>/UnaryUNINT8_STD_Test/512 205 ns 205 ns 3477179
LoadStoreFixture<uint8_t>/UnaryUNINT8_STD_Test/1024 393 ns 393 ns 1775957
`
As you could see loading/storing through simd is 3 times slower than basic for loop and its peanlize all std::like algorithm
I reproduce this behavior on msvc2017
May i made a mistake on compilation flags ? How i can go further?
Regards TR
Hey, thanks for the bug report. Could you share minimal reproduction case? I haven't seen any performance issues on GCC, so it's really interesting why you get poor performance.
Hi
I hope the problem is between my chair and my keyboard But anyway if you have little time to investigate this with me, i really appreciate I think the simpler is to use the benchmark i add on my fork So clone https://github.com/ThomasRetornaz/libsimdpp.git
-
Enable the benchmark (in build directory somewhere) cmake -DENABLE_BENCH=ON path_to_src This will download and compile googlebenchmark
-
Compile one possible target in (build_directory)/bench/insn let say make bench_insn_-x86_avx
-
You could launch the whole bench_suite or select only load_store test
-
(build_directory)/bench/insn$ ./bench_insn_-x86_avx --benchmark_filter=.*LoadStore.*
-
the code involved is here https://github.com/ThomasRetornaz/libsimdpp/blob/dev/bench/insn/load_store.cc
May i make mistake using cmake and compilation flags here
- https://github.com/ThomasRetornaz/libsimdpp/blob/dev/bench/insn/CMakeLists.txt#L16 or missuse google benchamrk in any manner
Thanks for your help Regards TR