fbarchard

Results 83 comments of fbarchard

The ADDTO macro seems like it would work ok on intel, but the above code is smaller/faster on x86 as well. So I'd suggest removing that macro and using C_ADD,...

benchmark of an end to end application, where kiss fft is about 60% of the profile: 32 bit Cortex A53 Original 53.3 us 4 loads 45.8 us neon 41.9 us...

There are 2 cases to test performance on 1. on cpus and data where prefetch is a win. In general, larger data, older cpu, and more complex functions benefit the...

Here's an assembly function where I add a proposed WASM prefetch instruction. I picked a place in the inner loop where the instruction should issue for free without stalls and...

In libyuv I inserted 223 prfm instructions all aarch64 functions and tried versions with each offset and then ran on all 64 bit cpus. 448 was the fastest overall. For...

To work with FIXED_POINT==32 as well as FIXED_POINT==16 this is the change ``` ==== kissfft/_kiss_fft_guts.h#1 - kissfft/_kiss_fft_guts.h ==== 74c74 < (x) = sround( smul( x, SAMP_MAX/k ) ) --- >...

I've narrowed it down to the params pointer. Here is a godbolt reproducible https://godbolt.org/z/aY4cx8vv3 The function is to 2 channels. The reason that matters is vmultiplier will have the wrong...

The test can be simplifed to: https://godbolt.org/z/Yz44hncq7 ``` #include struct f16_params{ __attribute__((__aligned__(16))) __fp16 multiplier; }; float16x4_t f16_dup(const struct f16_params params[static 1]) { return vld1_dup_f16(&params->multiplier); } ``` Which produces: ``` vld2.16...

godbolt with clang trunk is doing the correct sized dup now: ``` vldr d16, [r0] vdup.16 d0, d16[0] ``` so the fix to the dup works, but it would be...

yes, this is a small change with no impact on quality. I only did bfly4... it should be done for the bfly2 and others. And then we can remove C_ADDTO...