quick-bench-back-end icon indicating copy to clipboard operation
quick-bench-back-end copied to clipboard

target sse4.1

Open xoorath opened this issue 7 years ago • 6 comments

Hey,

Would be really handy for me if this could support avx, including specifically sse4.1. Benchmarking simple simd mathematics techniques is what I'm hoping to do, to make informed decisions on simd performance.

Here's a little test I did to see if quick-bench would help me do what I'm trying to do:

#include <x86intrin.h>

static void DPPS(benchmark::State& state) {
  __m128 left, right;
  left = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
  right = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
  for (auto _ : state) {
    __m128 dotted = _mm_dp_ps(left, right, 0xff);
    
    benchmark::DoNotOptimize(dotted);  
  }
  benchmark::DoNotOptimize(left);
  benchmark::DoNotOptimize(right);
}
// Register the function as a benchmark
BENCHMARK(DPPS);

static void MULHADD(benchmark::State& state) {
  __m128 left, right;
  left = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
  right = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
  for (auto _ : state) {
    __m128 dotted = _mm_mul_ps(left, right);
    dotted = _mm_hadd_ps(dotted, dotted);
    dotted = _mm_hadd_ps(dotted, dotted);
    
    benchmark::DoNotOptimize(dotted);  
  }
  benchmark::DoNotOptimize(left);
  benchmark::DoNotOptimize(right);
}
BENCHMARK(MULHADD);

The errors generated:

Error or timeout
bench-file.cpp:9:21: error: '__builtin_ia32_dpps' needs target feature sse4.1
    __m128 dotted = _mm_dp_ps(left, right, 0xff);
                    ^
/usr/lib/clang/5.0.0/include/smmintrin.h:620:12: note: expanded from macro '_mm_dp_ps'
  (__m128) __builtin_ia32_dpps((__v4sf)(__m128)(X), \
           ^
bench-file.cpp:26:14: error: always_inline function '_mm_hadd_ps' requires target feature 'sse3', but would be inlined into function 'MULHADD' that is compiled without support for 'sse3'
    dotted = _mm_hadd_ps(dotted, dotted);
             ^
bench-file.cpp:27:14: error: always_inline function '_mm_hadd_ps' requires target feature 'sse3', but would be inlined into function 'MULHADD' that is compiled without support for 'sse3'
    dotted = _mm_hadd_ps(dotted, dotted);
             ^
3 errors generated.

Cheers

xoorath avatar Jan 08 '18 23:01 xoorath

Hi, Quick Bench runs on AWS, that doesn't guarantee any architecture or CPU for the kind of machines the project can afford. Thus it is not possible to target a given architecture. Cheers!

FredTingaud avatar Jan 16 '18 16:01 FredTingaud

Are you sure that's true, @FredTingaud? What's the instance type?

I'm looking at the link mentioned in this excerpt...

Amazon EC2 instances run on 64-bit virtual Intel processors as specified in the instance type product pages. For more information about the hardware specifications for each Amazon EC2 instance type, see Amazon EC2 Instance Types.

*source

It looks like specific chipsets are used for given instance types. Considering SSE4 was introduced just over 10 years ago, I'd be surprised to see 4.1 not supported on your particular instance...

Maybe I'm not seeing the bigger picture here, I'm no AWS expert after all... but if it is guaranteed to support some vector extension set I think it would be really valuable to support benchmarking vectorized code. It's a very common and very misunderstood optimization technique, after all.

xoorath avatar Jan 16 '18 20:01 xoorath

I'm reopening the issue. I'll look into it more.

FredTingaud avatar Jan 16 '18 20:01 FredTingaud

Thanks @FredTingaud.

Just taking a little dive into it myself it seems like if you're on a T2 and not a compute instance you wont know for sure what chipset you're on.... But I'm not sure what a "64-bit virtual intel processor" even means, no less what limitations or allowances it creates.

Looking forward to hearing what you find on your end. Thanks for looking into this.

xoorath avatar Jan 16 '18 20:01 xoorath

@FredTingaud You can try adding "-march native" to the compiler options.

ZongyiZhou avatar Jun 12 '20 18:06 ZongyiZhou

Running into this again 4 years later, so I'm back to +1 my own issue. :)

This time I'm trying to benchmark __popcnt against other methods of counting bits in an integer.

xoorath avatar Feb 01 '22 18:02 xoorath