fbarchard comments

Results 83 comments of


                                            fbarchard

bfly load/stores

The ADDTO macro seems like it would work ok on intel, but the above code is smaller/faster on x86 as well. So I'd suggest removing that macro and using C_ADD,...

Neon

benchmark of an end to end application, where kiss fft is about 60% of the profile: 32 bit Cortex A53 Original 53.3 us 4 loads 45.8 us neon 41.9 us...

prefetch instruction

There are 2 cases to test performance on 1. on cpus and data where prefetch is a win. In general, larger data, older cpu, and more complex functions benefit the...

prefetch instruction

Here's an assembly function where I add a proposed WASM prefetch instruction. I picked a place in the inner loop where the instruction should issue for free without stalls and...

prefetch instruction

In libyuv I inserted 223 prfm instructions all aarch64 functions and tried versions with each offset and then ran on all 64 bit cpus. 448 was the fastest overall. For...

DIVSCALAR off by 1

To work with FIXED_POINT==32 as well as FIXED_POINT==16 this is the change ``` ==== kissfft/_kiss_fft_guts.h#1 - kissfft/_kiss_fft_guts.h ==== 74c74 < (x) = sround( smul( x, SAMP_MAX/k ) ) --- >...

AArch32 FP16 neon average function produces incorrect result when optimized

I've narrowed it down to the params pointer. Here is a godbolt reproducible https://godbolt.org/z/aY4cx8vv3 The function is to 2 channels. The reason that matters is vmultiplier will have the wrong...

AArch32 FP16 neon average function produces incorrect result when optimized

The test can be simplifed to: https://godbolt.org/z/Yz44hncq7 ``` #include struct f16_params{ __attribute__((__aligned__(16))) __fp16 multiplier; }; float16x4_t f16_dup(const struct f16_params params[static 1]) { return vld1_dup_f16(&params->multiplier); } ``` Which produces: ``` vld2.16...

AArch32 FP16 neon average function produces incorrect result when optimized

godbolt with clang trunk is doing the correct sized dup now: ``` vldr d16, [r0] vdup.16 d0, d16[0] ``` so the fix to the dup works, but it would be...

bfly load/stores

yes, this is a small change with no impact on quality. I only did bfly4... it should be done for the bfly2 and others. And then we can remove C_ADDTO...