WASM SIMD - Single instruction shuffle
Current implementation of wasm SIMD shuffle produces a scalar loop over the bytes of the vector. That is the leftover of the situation when there were multiple shuffles with different lane shuffles - byte shuffle was considered lower priority than other kinds.
To solve this Chakra would need to emit byteshuffle instructions (pshufb on x86).
Extra credit - detect shuffle masks that would be able to lower to hardware instructions other than shuffle bytes, but getting rid of the scalar loop would be already significant.
There is one case when detecting mask which would pretty straight-forward to implement: cases when opcodes normally resulting in Simd128LowerShuffle by Simd128LowerShuffle_4.