base64 icon indicating copy to clipboard operation
base64 copied to clipboard

Benchmarks

Open htot opened this issue 3 years ago • 4 comments

@aklomp @mayeut Again a draft. Please ignore the Benchmarks patch, I was to far to drop that and rebase against HEAD.

The interesting one is codec: add ssse3_atom.

My experience with CRC32C with Silvermont Atom (SLM) processors is that in 64b certain combinations of instructions incur a penalty (see Intel manuals) making the advantage of running in 64b mode negative in some cases. In later Atoms (Goldmont, Airmont) this penalty likely does not occur, but I don't have the hardware to test. Running base64 on SLM shows strange performance regressions while core i7 shows improvement.

So, I revived the best ssse3 codec as ssse3_atom and tested on Intel Edison (dual core 500MHz) in 64b/32b mode (because that is easy to do) and on Intel NUC with Baytrail Atom in 64b (to show the relevancy on main stream CPU).

Min - Speed (MB/sec) Direction        
  decode     encode    
Processor plain SSSE3 SSSE3_ATOM plain SSSE3 SSSE3_ATOM
Atom E3815 @ 1.46GHz (64b) 326 449 565 441 569 556
Edison @ 500MHz (32b) 40 102 103 67 111 111
Edison @ 500MHz (64b) 119 164 206 162 209 204
i7-10700 CPU @ 2.90GHz 3997 9356 4685 4387 8823 7593

Improvement by going back to the revived codec in bold, degradation in italic.

We see that on i7 the latest version is indeed the fastest, on SLM 32 bit there is no difference. But on SLM 64b SSSE3_ATOM is 25% faster. Now, having a fast algorithm has a much more noticable effect on a slow Atom then on a fast i7... So what do you guys think, should we add a specialized SSSE3 for SLM?

htot avatar Jun 22 '22 20:06 htot

@aqrit?

htot avatar Jun 23 '22 22:06 htot

For dec_loop: #46 is probably faster. Though, it does trade readability for speed.

dec_reshuffle without _mm_madd_epi16 could look like this:

// Pack 16 6-bit values into 12 bytes
// (wasm doesn't have pmaddubsw (but does have pmaddw))
const v128_t shuf = wasm_i8x16_const(2, 1, 0, 6, 5, 4, 10, 9, 8, 14, 13, 12, -1, -1, -1, -1);
v = wasm_v128_or(wasm_i16x8_shr_u(v, 6), wasm_i16x8_shl(v, 8));   // 00cccccc|dddddd00|00aaaaaa|bbbbbb00
v = wasm_v128_or(wasm_i32x4_shr_u(v, 18), wasm_i32x4_shl(v, 10)); // dddd0000|aaaaaabb|bbbbcccc|ccdddddd
v = wasm_i8x16_swizzle(v, shuf);                                  //       ..|ccdddddd|bbbbcccc|aaaaaabb

I'm don't know if it has better latency, but it does have fewer instructions and constants ... edit: in comparision to dec_reshuffle in this PR.

aqrit avatar Jun 23 '22 23:06 aqrit

Yeah, this draft PR just revives an older version of the codec which showed better performance then currently (on SLM). I didn't try to create my own improvement. PR #46 is a bit older, did you benchmark it at the time on atom?

htot avatar Jun 24 '22 07:06 htot

@aqrit would you rebase #46 on master? I'd like to run benchmarks on edison/atom

htot avatar Jun 24 '22 22:06 htot