tlsn Benchmark block ciphers without hardware acceleration

Let's benchmark Salsa vs AES (or any other candidate ciphers) executed in WASM to see which performs better when HW acceleration is not available.

AES implementation: https://github.com/RustCrypto/block-ciphers/blob/master/aes/src/soft.rs

Feb 04 '22 18:02 sinui0

Ran the tests with this gist https://gist.github.com/themighty1/9feb7ad9d3938cc2b06b0e30286a9c55

Feb 05 '22 10:02 themighty1

The numbers are: wasm AES is 2x faster that JS Salsa wasm Salsa is 6.5x faster than wasm AES (and consequently is 13x faster than JS Salsa)

Feb 05 '22 10:02 themighty1

I noticed that when I load up the html in the browser, then I get the stats I mentioned above. However, after invoking the function main() from the browser console after the page was loaded, I get ~4x worse speeds for Salsa and ~6x worse speeds for AES. Go figure.

Feb 13 '22 12:02 themighty1

I loaded the wasm inside the Chrome extension just for sanity check, I get the same results in the extension as when I reload the html page in the browser: AES bench - 750 ms Salsa bench - 150ms With JS Salsa 100K invocations take 1000ms.

We are looking at 6x speed up of wasm Salsa vs JS Salsa (not 13x as I initially thought). Actually, the initial 13x number came because I was benching in Firefox. Turns out in FF wasm Salsa is 2x faster that in Chrome.

Feb 13 '22 14:02 themighty1

I discovered that for some reason JS Salsa is faster than wasm Salsa, so I though maybe we could do gate-by-gate garbling in wasm but call out to JS for Salsa operations, so I put together this bench.

I had this chunk of code running inside Chrome extension which imitates garbling an 6400-and-gate AES circuit

console.time('bench');
let temp1 = crypto.getRandomValues(new Uint8Array(16));
let temp2 = temp1.slice();
let temp3 = temp1.slice();
let temp4 = temp1.slice();
for (let i=0; i < 6400; i++){
	call_wasm();
	temp1 = Salsa20(fixedKey, temp1);
	temp2 = Salsa20(fixedKey, temp2);
	temp3 = Salsa20(fixedKey, temp3);
	temp4 = Salsa20(fixedKey, temp4);
}
console.timeEnd('bench');
console.log(temp1, temp2, temp3, temp4)

call_wasm() calls a wasm function without arguments and returns from it without args. (in real life we will pass args as memory pointers to avoid overhead). I got: main.js:48 bench: 30.68408203125 ms main.js:48 bench: 19.382080078125 ms main.js:48 bench: 11.636962890625 ms main.js:48 bench: 11.719970703125 ms main.js:48 bench: 11.31787109375 ms main.js:48 bench: 12.278076171875 ms main.js:48 bench: 11.51416015625 ms main.js:48 bench: 12.4267578125 ms main.js:48 bench: 12.6298828125 ms main.js:48 bench: 11.842041015625 ms main.js:48 bench: 12.01611328125 ms

Seems like the 1st 2 invocations are doing some kind of a warm-up but then we get consistent 12ms per garbling. Back when I was benching PageSigner, I was always getting ~200ms for garbling 1 AES circuit. So, in JS the main overhead comes from reading/writing the array of gates. It remains to be seen how much of that overhead will be saved with a Rust implementation.

Feb 15 '22 10:02 themighty1

Would you mind pushing these benchmarks into a repo I can clone? I'm interested in tinkering with this as well

Feb 15 '22 19:02 sinui0

Sure, Im seeing some non-deterministic behaviour with my benches. I'll iron out the bugs and publish the tests.

Feb 15 '22 19:02 themighty1

Here it is https://github.com/themighty1/salsabench Just follow the README

Notice that when reloading the page you'll get a certain "tweetnacl Salsa bench" number. This number will be usually 2x larger than the actual number I was getting in Chrome extension after 1-2 invocations warm-up (I guess it warms up the caches).

Either way, even without the warm-up, JS Salsa is the clear winner.

Feb 16 '22 08:02 themighty1

I made a separate benchmark of the cost of shared memory IPC between JS<>wasm. The result on Chrome is +30% to the execution time which is still tolerable and leaves JS Salsa the winner.

Feb 16 '22 13:02 themighty1

Here tweetnacl's author says that his JS lib is faster than wasm: https://github.com/dchest/tweetnacl-js/issues/141 twitter link: "- Salsa20 compiled to WebAssembly is not faster than JS until it gets SIMD."

Feb 19 '22 13:02 themighty1

Updated my salsabench repo with a bench of chacha20 from libsodium.js. It is 10x slower than tweetnacl's salsa20.

Feb 19 '22 14:02 themighty1

This makes sense, I wasn't aware that JS was able to take advantage of hardware acceleration as well as it does. So in browser contexts, calling out to JS for crypto operations will be the way to go. We can work to vectorize as many operations as possible to minimize communication overhead between runtimes.

Feb 19 '22 20:02 sinui0

According to https://webassembly.org/roadmap/ the major browsers have implemented 128bit SIMD for wasm. I'm not sure if the rust libraries take advantage of it, but the results below suggest that they do.

I forked your bench and added a rust library for Instant support in wasm for easier timings. Running it on my end shows wasm to be only slightly slower than the JS libraries:

libsodium Chacha: 178.864013671875 ms
tweetnacl Salsa bench: 36.630859375 ms
AES 150ms
Salsa 40ms
Ported Salsa 143ms

Compiling the wasm binary with RUSTFLAGS='-C target-feature=+simd128' appears to bring it right up to speed with JS tweetnacl.

libsodium Chacha: 178.924072265625 ms
tweetnacl Salsa bench: 37.94091796875 ms
AES 108ms
Salsa 38ms
Ported Salsa 142ms

Feb 19 '22 21:02 sinui0

Jacking up the round count to 10,000,000 helps remove any distortions from timing calls. Doing so shows that tweetnacl salsa is infact almost 2X faster. I'm curious why that is the case though if wasm has SIMD.

libsodium Chacha: 7127.891845703125 ms
tweetnacl Salsa bench: 2095.0009765625 ms
AES 10.49s
Salsa 3.852s

Feb 20 '22 01:02 sinui0

Thanks, using your repo with 10 mil rounds, I consistenstly see tweet salsa 5x faster than rust native salsa . I guess it depends on the CPU too, mine is from 2016. tried rustflags as you suggested, and immediately saw rust salsa speed up. On my machine it got only 3.5x slower that tweetsalsa.

Feb 22 '22 14:02 themighty1

tlsn tlsn copied to clipboard

Benchmark block ciphers without hardware acceleration

tlsn
tlsn copied to clipboard