quick-bench-back-end
quick-bench-back-end copied to clipboard
target sse4.1
Hey,
Would be really handy for me if this could support avx, including specifically sse4.1. Benchmarking simple simd mathematics techniques is what I'm hoping to do, to make informed decisions on simd performance.
Here's a little test I did to see if quick-bench would help me do what I'm trying to do:
#include <x86intrin.h>
static void DPPS(benchmark::State& state) {
__m128 left, right;
left = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
right = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
for (auto _ : state) {
__m128 dotted = _mm_dp_ps(left, right, 0xff);
benchmark::DoNotOptimize(dotted);
}
benchmark::DoNotOptimize(left);
benchmark::DoNotOptimize(right);
}
// Register the function as a benchmark
BENCHMARK(DPPS);
static void MULHADD(benchmark::State& state) {
__m128 left, right;
left = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
right = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
for (auto _ : state) {
__m128 dotted = _mm_mul_ps(left, right);
dotted = _mm_hadd_ps(dotted, dotted);
dotted = _mm_hadd_ps(dotted, dotted);
benchmark::DoNotOptimize(dotted);
}
benchmark::DoNotOptimize(left);
benchmark::DoNotOptimize(right);
}
BENCHMARK(MULHADD);
The errors generated:
Error or timeout
bench-file.cpp:9:21: error: '__builtin_ia32_dpps' needs target feature sse4.1
__m128 dotted = _mm_dp_ps(left, right, 0xff);
^
/usr/lib/clang/5.0.0/include/smmintrin.h:620:12: note: expanded from macro '_mm_dp_ps'
(__m128) __builtin_ia32_dpps((__v4sf)(__m128)(X), \
^
bench-file.cpp:26:14: error: always_inline function '_mm_hadd_ps' requires target feature 'sse3', but would be inlined into function 'MULHADD' that is compiled without support for 'sse3'
dotted = _mm_hadd_ps(dotted, dotted);
^
bench-file.cpp:27:14: error: always_inline function '_mm_hadd_ps' requires target feature 'sse3', but would be inlined into function 'MULHADD' that is compiled without support for 'sse3'
dotted = _mm_hadd_ps(dotted, dotted);
^
3 errors generated.
Cheers
Hi, Quick Bench runs on AWS, that doesn't guarantee any architecture or CPU for the kind of machines the project can afford. Thus it is not possible to target a given architecture. Cheers!
Are you sure that's true, @FredTingaud? What's the instance type?
I'm looking at the link mentioned in this excerpt...
Amazon EC2 instances run on 64-bit virtual Intel processors as specified in the instance type product pages. For more information about the hardware specifications for each Amazon EC2 instance type, see Amazon EC2 Instance Types.
It looks like specific chipsets are used for given instance types. Considering SSE4 was introduced just over 10 years ago, I'd be surprised to see 4.1 not supported on your particular instance...
Maybe I'm not seeing the bigger picture here, I'm no AWS expert after all... but if it is guaranteed to support some vector extension set I think it would be really valuable to support benchmarking vectorized code. It's a very common and very misunderstood optimization technique, after all.
I'm reopening the issue. I'll look into it more.
Thanks @FredTingaud.
Just taking a little dive into it myself it seems like if you're on a T2 and not a compute instance you wont know for sure what chipset you're on.... But I'm not sure what a "64-bit virtual intel processor" even means, no less what limitations or allowances it creates.
Looking forward to hearing what you find on your end. Thanks for looking into this.
@FredTingaud You can try adding "-march native" to the compiler options.
Running into this again 4 years later, so I'm back to +1 my own issue. :)
This time I'm trying to benchmark __popcnt against other methods of counting bits in an integer.