sse2neon
sse2neon copied to clipboard
Add performance measurement of each intrinsic function
The conversion of intrinsic function may be rewritten sometimes. A performance measurement result is good for checking whether the rewritten conversion behaves better or not.
Somewhat related, but not exactly the same:
Something would be nice would be to have a list of all the implemented intrinsics and to how many NEON intrinsics that is needed to implement it. While this doesn't give you a 1:1 mapping to how fast something is, it still gives the user and idea what might be good to avoid.
In in a similar view, the emscripten SIMD page (https://emscripten.org/docs/porting/simd.html) has an emoji for each instruction.
✅ Wasm SIMD has a native opcode that matches the x86 SSE instruction, should yield native performance
💡 while the Wasm SIMD spec does not provide a proper performance guarantee, given a suitably smart enough compiler and a runtime VM path, this intrinsic should be able to generate the identical native SSE instruction.
🟡 there is some information missing (e.g. type or alignment information) for a Wasm VM to be guaranteed to be able to reconstruct the intended x86 SSE opcode. This might cause a penalty depending on the target CPU hardware family, especially on older CPU generations.
⚠️ the underlying x86 SSE instruction is not available, but it is emulated via at most few other Wasm SIMD instructions, causing a small penalty.
❌ the underlying x86 SSE instruction is not exposed by the Wasm SIMD specification, so it must be emulated via a slow path, e.g. a sequence of several slower SIMD instructions, or a scalar implementation.
💣 the underlying x86 SSE opcode is not available in Wasm SIMD, and the implementation must resort to such a slow emulated path, that a workaround rethinking the algorithm at a higher level is advised.
💭 the given SSE intrinsic is available to let applications compile, but does nothing.
⚫ the given SSE intrinsic is not available. Referencing the intrinsic will cause a compiler error.