Nicholas Frechette comments

Results 89 comments of


                                            Nicholas Frechette

Add XMVectorRound half away from zero alternative

This is also called symmetric rounding. * 0.5 -> 1.0 * -0.5 -> -1.0

v128.const and v128.shuffle blow up code size

Shuffles are very common in realtime 3d applications. From regular 3x3, 3x4, and 4x4 matrix math to quaternion math too. Over a whole application it might not contribute all that...

Add Quasi-Fused Multiply-Add/Subtract instructions

My experience runs contrary to yours, @zeux . In my tests, FMA instructions are always slower under x64. I did notice fast-math automatically generating them and it is one of...

Add Quasi-Fused Multiply-Add/Subtract instructions

FMA is definitely not free on Ryzen. The picture isn't clear cut. For example, Ryzen executes up to 5 instructions per cycle. addps takes 1 op, has a latency of...

Inefficient x64 codegen for bitselect

Even if the instructions dispatch in the same cycle with and/andnot/or, they will not retire in the same cycle due to the dependencies. Depending on the surrounding code, it's possible...

Inefficient x64 codegen for 8x16 shifts

Shifts are commonly used with fixed point arithmetic but it isn't as common on 8 bit values (16/32/64 bit being the most common). I also imagine that it might be...

Inefficient x64 codegen for 8x16 shifts

I agree with @zeux that loading constants with fewer instructions is generally the way to go. It improves the code density and by reducing the number of instructions and registers...

Documenting performance tradeoffs

Great stuff! I will definitely keep it open in a tab as a reference when porting my code. It is worth noting that on ARM v7 and ARM 64, it...

Documenting performance tradeoffs

@dtig As the discussion around shuffles has shown, some mappings will have poor or expected performance on some platforms and this isn't easily avoided without more platform specific intrinsics (which...

Implement spline key reduction algorithm

Cody Jones has began this work [here](https://github.com/CodyDWJones/acl)