libsimdpp icon indicating copy to clipboard operation
libsimdpp copied to clipboard

[Load/Store] Very slow

Open ThomasRetornaz opened this issue 6 years ago • 2 comments

Hi When i made benchmark on std::like algorithm (see #107 and #115) i felt onto surprising behavior std::transform algorithm is faster than simd counterpart WTF i said on my desk I profiled the code and suspects are generalized load/store functions I made a dedicated benchmark suite on them And yes, load/store seems to be slow

On avx2 architecture, Ubuntu 5.4.0-6ubuntu1~16.04.9 g++ 5.4.0

i compare c++ for loop and simd like for loop ` for (size_t i = 0; i < size; ++i) { *ptrout++=*ptrin++; }

Versus for (size_t i = 0; i < size; i += simd_size) { simd_type_T element = simdpp::load(ptrin); simdpp::store(ptrout, element); ptrin += simd_size; ptrout += simd_size; }

Results for uint8_t and various size LoadStoreFixture<uint8_t>/UnaryUNINT8_SIMD_Test/64 72 ns 72 ns 9921739 LoadStoreFixture<uint8_t>/UnaryUNINT8_SIMD_Test/128 146 ns 146 ns 4629647 LoadStoreFixture<uint8_t>/UnaryUNINT8_SIMD_Test/256 288 ns 288 ns 2460696 LoadStoreFixture<uint8_t>/UnaryUNINT8_SIMD_Test/512 573 ns 573 ns 1200729 LoadStoreFixture<uint8_t>/UnaryUNINT8_SIMD_Test/1024 1116 ns 1116 ns 634524 LoadStoreFixture<uint8_t>/UnaryUNINT8_STD_Test/64 30 ns 30 ns 23157949 LoadStoreFixture<uint8_t>/UnaryUNINT8_STD_Test/128 56 ns 56 ns 12685794 LoadStoreFixture<uint8_t>/UnaryUNINT8_STD_Test/256 104 ns 104 ns 6773528 LoadStoreFixture<uint8_t>/UnaryUNINT8_STD_Test/512 205 ns 205 ns 3477179 LoadStoreFixture<uint8_t>/UnaryUNINT8_STD_Test/1024 393 ns 393 ns 1775957 ` As you could see loading/storing through simd is 3 times slower than basic for loop and its peanlize all std::like algorithm

I reproduce this behavior on msvc2017

May i made a mistake on compilation flags ? How i can go further?

Regards TR

ThomasRetornaz avatar Jul 20 '18 13:07 ThomasRetornaz

Hey, thanks for the bug report. Could you share minimal reproduction case? I haven't seen any performance issues on GCC, so it's really interesting why you get poor performance.

p12tic avatar Jul 20 '18 15:07 p12tic

Hi

I hope the problem is between my chair and my keyboard But anyway if you have little time to investigate this with me, i really appreciate I think the simpler is to use the benchmark i add on my fork So clone https://github.com/ThomasRetornaz/libsimdpp.git

  • Enable the benchmark (in build directory somewhere) cmake -DENABLE_BENCH=ON path_to_src This will download and compile googlebenchmark

  • Compile one possible target in (build_directory)/bench/insn let say make bench_insn_-x86_avx

  • You could launch the whole bench_suite or select only load_store test

  • (build_directory)/bench/insn$ ./bench_insn_-x86_avx --benchmark_filter=.*LoadStore.*

  • the code involved is here https://github.com/ThomasRetornaz/libsimdpp/blob/dev/bench/insn/load_store.cc

May i make mistake using cmake and compilation flags here

  • https://github.com/ThomasRetornaz/libsimdpp/blob/dev/bench/insn/CMakeLists.txt#L16 or missuse google benchamrk in any manner

Thanks for your help Regards TR

ThomasRetornaz avatar Jul 20 '18 17:07 ThomasRetornaz