base64 Benchmarks

@aklomp @mayeut Again a draft. Please ignore the Benchmarks patch, I was to far to drop that and rebase against HEAD.

The interesting one is codec: add ssse3_atom.

My experience with CRC32C with Silvermont Atom (SLM) processors is that in 64b certain combinations of instructions incur a penalty (see Intel manuals) making the advantage of running in 64b mode negative in some cases. In later Atoms (Goldmont, Airmont) this penalty likely does not occur, but I don't have the hardware to test. Running base64 on SLM shows strange performance regressions while core i7 shows improvement.

So, I revived the best ssse3 codec as ssse3_atom and tested on Intel Edison (dual core 500MHz) in 64b/32b mode (because that is easy to do) and on Intel NUC with Baytrail Atom in 64b (to show the relevancy on main stream CPU).

Min - Speed (MB/sec)	Direction
	decode			encode
Processor	plain	SSSE3	SSSE3_ATOM	plain	SSSE3	SSSE3_ATOM
Atom E3815 @ 1.46GHz (64b)	326	449	565	441	569	556
Edison @ 500MHz (32b)	40	102	103	67	111	111
Edison @ 500MHz (64b)	119	164	206	162	209	204
i7-10700 CPU @ 2.90GHz	3997	9356	4685	4387	8823	7593

Improvement by going back to the revived codec in bold, degradation in italic.

We see that on i7 the latest version is indeed the fastest, on SLM 32 bit there is no difference. But on SLM 64b SSSE3_ATOM is 25% faster. Now, having a fast algorithm has a much more noticable effect on a slow Atom then on a fast i7... So what do you guys think, should we add a specialized SSSE3 for SLM?

Jun 22 '22 20:06 htot

@aqrit?

Jun 23 '22 22:06 htot

For dec_loop: #46 is probably faster. Though, it does trade readability for speed.

dec_reshuffle without _mm_madd_epi16 could look like this:

// Pack 16 6-bit values into 12 bytes
// (wasm doesn't have pmaddubsw (but does have pmaddw))
const v128_t shuf = wasm_i8x16_const(2, 1, 0, 6, 5, 4, 10, 9, 8, 14, 13, 12, -1, -1, -1, -1);
v = wasm_v128_or(wasm_i16x8_shr_u(v, 6), wasm_i16x8_shl(v, 8));   // 00cccccc|dddddd00|00aaaaaa|bbbbbb00
v = wasm_v128_or(wasm_i32x4_shr_u(v, 18), wasm_i32x4_shl(v, 10)); // dddd0000|aaaaaabb|bbbbcccc|ccdddddd
v = wasm_i8x16_swizzle(v, shuf);                                  //       ..|ccdddddd|bbbbcccc|aaaaaabb

I'm don't know if it has better latency, but it does have fewer instructions and constants ... edit: in comparision to dec_reshuffle in this PR.

Jun 23 '22 23:06 aqrit

Yeah, this draft PR just revives an older version of the codec which showed better performance then currently (on SLM). I didn't try to create my own improvement. PR #46 is a bit older, did you benchmark it at the time on atom?

Jun 24 '22 07:06 htot

@aqrit would you rebase #46 on master? I'd like to run benchmarks on edison/atom

Jun 24 '22 22:06 htot

base64 base64 copied to clipboard

Benchmarks

base64
base64 copied to clipboard