Arseny Kapoulkine comments

Results 299 comments of


                                            Arseny Kapoulkine

trafficstars

Inefficient x64 codegen for integer comparisons

To add to the list, compare for inequality takes 3 instructions, but compare for equality takes 1. I've ran into this in context of LLVM strength-reducing one to another: it...

Add Quasi-Fused Multiply-Add/Subtract instructions

FWIW from my point of view, this is a good addition. In native code [I am used to], the non-determinism *between* different compilers / architectures is a given - different...

Add Quasi-Fused Multiply-Add/Subtract instructions

> In my tests, FMA instructions are always slower under x64. This obviously depends on the workload. In (some) floating point heavy code that I see, FMA results in performance...

Add Quasi-Fused Multiply-Add/Subtract instructions

Here's a motivating example from my experiments (caveats apply, ymmv, etc etc.): This is an implementation of an slerp optimization from https://zeux.io/2015/07/23/approximating-slerp/ + https://zeux.io/2016/05/05/optimizing-slerp/ for fitted nlerp (the middle ground...

Add Quasi-Fused Multiply-Add/Subtract instructions

> Beware that your baseline is in SSE while fast-math+fma is in AVX2. Thanks! Good catch, I forgot about this. I've updated the post to include fast-math avx2 (55.82 cycles)....

Add Quasi-Fused Multiply-Add/Subtract instructions

With fmadd having the latency of 5 on older CPUs, you still save latency because the mul+add pair has a combined latency of 6 cycles due to the dependency. You...

Inefficient x64 codegen for fabs/fneg

(sorry, I know you're all tired of me and my "constants are awesome" comments) fabsf/fneg are single instruction (two uops) with a memory load. You only need a single constant...

Inefficient x64 codegen for fabs/fneg

If you don’t like constants, alternative lowering options: fneg: xorps (to produce zero) + subps; 2 insn if you can’t find a ready to use zero register, 1 if you...

Byte shuffle / table lookup operations

FWIW something along the lines of i8x16.byteshuffle is a feature I'd love to see in wasm. As mentioned in the OP, it is supported by SSSE3 on x86 side via...

Byte shuffle / table lookup operations

In the linked PR, @tlively and @dtig requested more precise benchmark numbers that are more specific to byte shuffles. To that end, I've implemented an SSE2 fallback for the decoder...