tlsn
tlsn copied to clipboard
Benchmark block ciphers without hardware acceleration
Let's benchmark Salsa vs AES (or any other candidate ciphers) executed in WASM to see which performs better when HW acceleration is not available.
AES implementation: https://github.com/RustCrypto/block-ciphers/blob/master/aes/src/soft.rs
Ran the tests with this gist https://gist.github.com/themighty1/9feb7ad9d3938cc2b06b0e30286a9c55
The numbers are: wasm AES is 2x faster that JS Salsa wasm Salsa is 6.5x faster than wasm AES (and consequently is 13x faster than JS Salsa)
I noticed that when I load up the html in the browser, then I get the stats I mentioned above. However, after invoking the function main() from the browser console after the page was loaded, I get ~4x worse speeds for Salsa and ~6x worse speeds for AES. Go figure.
I loaded the wasm inside the Chrome extension just for sanity check, I get the same results in the extension as when I reload the html page in the browser: AES bench - 750 ms Salsa bench - 150ms With JS Salsa 100K invocations take 1000ms.
We are looking at 6x speed up of wasm Salsa vs JS Salsa (not 13x as I initially thought). Actually, the initial 13x number came because I was benching in Firefox. Turns out in FF wasm Salsa is 2x faster that in Chrome.
I discovered that for some reason JS Salsa is faster than wasm Salsa, so I though maybe we could do gate-by-gate garbling in wasm but call out to JS for Salsa operations, so I put together this bench.
I had this chunk of code running inside Chrome extension which imitates garbling an 6400-and-gate AES circuit
console.time('bench');
let temp1 = crypto.getRandomValues(new Uint8Array(16));
let temp2 = temp1.slice();
let temp3 = temp1.slice();
let temp4 = temp1.slice();
for (let i=0; i < 6400; i++){
call_wasm();
temp1 = Salsa20(fixedKey, temp1);
temp2 = Salsa20(fixedKey, temp2);
temp3 = Salsa20(fixedKey, temp3);
temp4 = Salsa20(fixedKey, temp4);
}
console.timeEnd('bench');
console.log(temp1, temp2, temp3, temp4)
call_wasm() calls a wasm function without arguments and returns from it without args. (in real life we will pass args as memory pointers to avoid overhead). I got: main.js:48 bench: 30.68408203125 ms main.js:48 bench: 19.382080078125 ms main.js:48 bench: 11.636962890625 ms main.js:48 bench: 11.719970703125 ms main.js:48 bench: 11.31787109375 ms main.js:48 bench: 12.278076171875 ms main.js:48 bench: 11.51416015625 ms main.js:48 bench: 12.4267578125 ms main.js:48 bench: 12.6298828125 ms main.js:48 bench: 11.842041015625 ms main.js:48 bench: 12.01611328125 ms
Seems like the 1st 2 invocations are doing some kind of a warm-up but then we get consistent 12ms per garbling. Back when I was benching PageSigner, I was always getting ~200ms for garbling 1 AES circuit. So, in JS the main overhead comes from reading/writing the array of gates. It remains to be seen how much of that overhead will be saved with a Rust implementation.
Would you mind pushing these benchmarks into a repo I can clone? I'm interested in tinkering with this as well
Sure, Im seeing some non-deterministic behaviour with my benches. I'll iron out the bugs and publish the tests.
Here it is https://github.com/themighty1/salsabench Just follow the README
Notice that when reloading the page you'll get a certain "tweetnacl Salsa bench" number. This number will be usually 2x larger than the actual number I was getting in Chrome extension after 1-2 invocations warm-up (I guess it warms up the caches).
Either way, even without the warm-up, JS Salsa is the clear winner.
I made a separate benchmark of the cost of shared memory IPC between JS<>wasm. The result on Chrome is +30% to the execution time which is still tolerable and leaves JS Salsa the winner.
Here tweetnacl's author says that his JS lib is faster than wasm: https://github.com/dchest/tweetnacl-js/issues/141 twitter link: "- Salsa20 compiled to WebAssembly is not faster than JS until it gets SIMD."
Updated my salsabench repo with a bench of chacha20 from libsodium.js. It is 10x slower than tweetnacl's salsa20.
This makes sense, I wasn't aware that JS was able to take advantage of hardware acceleration as well as it does. So in browser contexts, calling out to JS for crypto operations will be the way to go. We can work to vectorize as many operations as possible to minimize communication overhead between runtimes.
According to https://webassembly.org/roadmap/ the major browsers have implemented 128bit SIMD for wasm. I'm not sure if the rust libraries take advantage of it, but the results below suggest that they do.
I forked your bench and added a rust library for Instant support in wasm for easier timings. Running it on my end shows wasm to be only slightly slower than the JS libraries:
libsodium Chacha: 178.864013671875 ms
tweetnacl Salsa bench: 36.630859375 ms
AES 150ms
Salsa 40ms
Ported Salsa 143ms
Compiling the wasm binary with RUSTFLAGS='-C target-feature=+simd128'
appears to bring it right up to speed with JS tweetnacl.
libsodium Chacha: 178.924072265625 ms
tweetnacl Salsa bench: 37.94091796875 ms
AES 108ms
Salsa 38ms
Ported Salsa 142ms
Jacking up the round count to 10,000,000 helps remove any distortions from timing calls. Doing so shows that tweetnacl salsa is infact almost 2X faster. I'm curious why that is the case though if wasm has SIMD.
libsodium Chacha: 7127.891845703125 ms
tweetnacl Salsa bench: 2095.0009765625 ms
AES 10.49s
Salsa 3.852s
Thanks, using your repo with 10 mil rounds, I consistenstly see tweet salsa 5x faster than rust native salsa . I guess it depends on the CPU too, mine is from 2016. tried rustflags as you suggested, and immediately saw rust salsa speed up. On my machine it got only 3.5x slower that tweetsalsa.