simd v128.const and v128.shuffle blow up code size

v128.const and v128.shuffle instructions are always 18 bytes long, which is excessive in the majority of cases. For comparison, native shuffle instructions on x86-64 are at most five bytes long, even though they are more restricted in shuffle patterns. For an example here's an implementation of 4x4 matrix transpose in WAsm SIMD and SSE intrinsics:

#include <wasm_simd128.h>

void transpose4x4(float matrix[16]) {
	const v128_t row0 = wasm_v128_load(matrix);
	const v128_t row1 = wasm_v128_load(matrix + 4);
	const v128_t row2 = wasm_v128_load(matrix + 8);
	const v128_t row3 = wasm_v128_load(matrix + 12);

	const v128_t tmp0 = wasm_v32x4_shuffle(row0, row1, 0, 4, 1, 5);
	const v128_t tmp1 = wasm_v32x4_shuffle(row2, row3, 0, 4, 1, 5);
	const v128_t tmp2 = wasm_v32x4_shuffle(row0, row1, 2, 6, 3, 7);
	const v128_t tmp3 = wasm_v32x4_shuffle(row2, row3, 2, 6, 3, 7);

	const v128_t col0 = wasm_v32x4_shuffle(tmp0, tmp1, 0, 1, 4, 5);
	const v128_t col1 = wasm_v32x4_shuffle(tmp0, tmp1, 2, 3, 6, 7);
	const v128_t col2 = wasm_v32x4_shuffle(tmp2, tmp3, 0, 1, 4, 5);
	const v128_t col3 = wasm_v32x4_shuffle(tmp2, tmp3, 2, 3, 6, 7);

	wasm_v128_store(matrix, col0);
	wasm_v128_store(matrix + 4, col1);
	wasm_v128_store(matrix + 8, col2);
	wasm_v128_store(matrix + 12, col3);
}

#include <xmmintrin.h>

void transpose(float matrix[16]) {
	const __m128 row0 = _mm_loadu_ps(matrix);
	const __m128 row1 = _mm_loadu_ps(matrix + 4);
	const __m128 row2 = _mm_loadu_ps(matrix + 8);
	const __m128 row3 = _mm_loadu_ps(matrix + 12);

	const __m128 tmp0 = _mm_unpacklo_ps(row0, row1);
	const __m128 tmp2 = _mm_unpacklo_ps(row2, row3);
	const __m128 tmp1 = _mm_unpackhi_ps(row0, row1);
	const __m128 tmp3 = _mm_unpackhi_ps(row2, row3);
	
	const __m128 col0 = _mm_movelh_ps(tmp0, tmp2);
	const __m128 col1 = _mm_movehl_ps(tmp2, tmp0);
	const __m128 col2 = _mm_movelh_ps(tmp1, tmp3);
	const __m128 col3 = _mm_movehl_ps(tmp3, tmp1);

	_mm_storeu_ps(matrix, col0);
	_mm_storeu_ps(matrix + 4, col1);
	_mm_storeu_ps(matrix + 8, col2);
	_mm_storeu_ps(matrix + 12, col3);
}

x86-64 SSE version compiles to 67 bytes of code, but WAsm SIMD version produce 231 bytes of code 3.5X larger.

I suggest that WAsm SIMD defines more restricted variants of v128.shuffle and v128.const which could be encoded in more concise representation.

Jun 12 '20 19:06 Maratyszcza

Agreed that these instructions are very large, but they're likely to be relatively rare in the module. I'd be curious to see how often they occur in real modules.

Jun 12 '20 19:06 binji

Shuffles are very common in realtime 3d applications. From regular 3x3, 3x4, and 4x4 matrix math to quaternion math too. Over a whole application it might not contribute all that much though, but in a hot code path it definitely could add significant overhead. As long as the bytecode can be compiled into efficient machine code, it should go down in asm size and not be a huge deal IMO.

Jun 12 '20 20:06 nfrechette

@binji just a 4x4 transpose ended with 8 of them, and they would be present in most of the codes that rearrange elements in an input, for example in code dealing with matrices or bitmapped graphics. Free-form shuffles are relatively rare real-world instructions, but the operations demonstrated in the example above are quite common in SIMD code.

Part of the problem is that we end up using v8x16.shuffle for things that won't be an actual shuffle instruction at runtime (and should not be, for performance reasons). A lot of the common algorithms won't use "true" native shuffles at all and would use more compact instructions instead - so the extra bits we use to encode v8x16.shuffle masks for those won't be actually used by runtime (aside from detecting what variant it can emit). If we had at least some of the common operations (#199) covered, we won't be using v8x16.shuffle as often.

@Maratyszcza thank you for the great illustration!

Jun 13 '20 01:06 penzn

Given all the related discussion we've had about shuffles, I would really want to see the code size impact on a whole application (even one constructed to heavily use shuffles) before reconsidering whether we should have more specific shuffle operations.

Jun 13 '20 01:06 tlively

Worth noting is that repeated shuffle masks should compress reasonably well. I've compiled both examples into a binary (extracting just "machine" code) and used zstd to compress these (zstd over gzip since it has smaller headers, unsure how to coerce gzip to use raw deflate without the cruft):

/mnt/c/work/shuffles $ zstd transpose-sse.o -9
transpose-sse.o      :118.31%   (    71 =>     84 bytes, transpose-sse.o.zst)
/mnt/c/work/shuffles $ zstd transpose-wasm.o -9
transpose-wasm.o     : 60.79%   (   227 =>    138 bytes, transpose-wasm.o.zst)

So yes there's a large impact pre-compression, but it's less significant post-compression. On a larger scale program I feel like the efficiency of LZ matches is going to be even better, here the matches are likely mostly 4 byte wide and occasionally 8 byte wide for shuffle masks, whereas on a larger program I'd expect few unique shuffle masks and deduplication (through LZ) across different functions.

Jun 13 '20 02:06 zeux