simd
simd copied to clipboard
Branch of the spec repo scoped to discussion of SIMD in WebAssembly
``` #include #define SIMDPP_ARCH_X86_SSE4_1 1 #include "simdpp/simd.h" #include #include #include #include #include #include #include #include // SWIZZLE constants const static __u32x4 sw1 __attribute__((require_constant_initialization)) = {0xffffff00, 0xffffff01, 0xffffff02, 0xffffff03}; const static...
Packed horizontal arithmetic is reasonably performant on SSE3+ and Neon. These would be useful for complex multiplications, and in the absence of the opcodes below, these would need to be...
`v128.const` and `v128.shuffle` instructions are always 18 bytes long, which is excessive in the majority of cases. For comparison, native shuffle instructions on x86-64 are at most five bytes long,...
I've taken a stab at documenting the performance tradeoffs between various instructions and collected the information in this repository: https://github.com/zeux/wasm-simd Obviously this is far from a normative reference, and is...
Hi, add, sub, div, sqrt, mul, and float->integer conversions, should all have a third optional argument, to allow changing a rounding mode (nearest, toward plus infinity, towards minus infinity, to...
This is open-ended. The problem is that many key use cases, such as matrix multiplication kernels, need to know a number of SIMD vector registers that they can count on...
Suppose you have two vectors u and v, and you want to multiply all elements of the vector u by a single lane of the vector v, e.g. v[0]. This...
In SIMD instruction sets where integer arithmetic is really first-class, support for different bit width in different operands is not just an ad-hoc addition for a few instructions. Instead, generically...
Unlike the float case where the fused-vs-unfused issue creates complications (PR #79) in the integer case there is no downside to using single-instruction multiply-add. These are vital to getting above...
I looked into single-precision Mersenne Twister after @ngzhian did double-precision port, and its core function relies on "shift bytes" behavior, represented by `PSLLDQ`/`PSRLDQ` on x86 and `VEXT` on Arm: https://github.com/penzn/SFMT-wasm/blob/master/SFMT-sse2.h#L37...