carrot Performance Optimizations (including Rust/AssemblyScript -> WASM)

This document will be a work-in-progress while figuring out a game plan for how and where efforts can be best-focused for performance increases.

I have experimented with benchmarking the squash() and sigmoid()/sigmoidDerivative() functions re-written in both Rust and AssemblyScript compiled to WASM.

Rust -> WASM showed very large (~x4.5) performance increases versus JS on Chrome with squash() given 1,000,000 inputs. On Firefox, the performance was roughly the same. Still need to benchmark Rust WASM on Node.

AssemblyScript -> WASM was ~x2.5 as performant with squash() given 1,000,000 inputs on Node. Still need to benchmark AssemblyScript WASM on browsers.

I will create a repo for the test functions and post the source code here, as well as upload screenshots from console output that have benchmark measurements soon.

@postspectacular gave some feedback and insane performance tuning on the basic implementation I had in AssemblyScript as well:

https://gist.github.com/postspectacular/3dccbfed1b753edadf1b6fee8add4808#file-03-sigmoid-simd-ts-L1

He dropped down to low-level memory/pointers, and even to SIMD instructions (yet from his comment, it seems as though there are limitations in WASM with V8 engine that dont support it quite yet). So there is obviously much room to improve when considering these benchmarks -- I am not familiar with Rust, AssemblyScript, or even low-level memory concepts on the whole. Perhaps some other users might be able to give feedback or advice here?

The best way to approach this is probably:

Profile the project running, and see which functions are most used. Apply the Pareto Principle/80-20 Rule, and it is likely that a small handful of methods are called the grand majority of time. These are where efforts should be focused.
Create a way to benchmark these functions with mock data in an isolated environment, testing without this will be really difficult.
Experiment with re-writing the most used functions in Rust/AssemblyScript. Benchmark, consider results.
Look into using tooling from existing libraries to make things easier. Especially when it comes to math and matrix stuff.
Read into using GPU.js and WebGL to represent array data as shaders. Umbrella has tooling for this too, I believe.

Nov 02 '19 20:11 GavinRay97

Interesting. Btw I compared fastexp and native Math.exp which builtin in AssemblyScript and builtin it seems faster on latest Firefox & Chrome:

native wasm exp: 152.708984375ms
fastexp: 201.573974609375ms

See benchmark fiddle: https://webassembly.studio/?f=nd524yx5pt

Nov 02 '19 21:11 MaxGraey

@MaxGraey

Huh, interesting. I get 60ms/70ms Native/fastexp on Firefox Nightly with my machine.

The original, very naive AssemblyScript code I wrote for benchmarking against JS is below: (x being an array of 1,000,000 floats between -1,000,000 and 1,000,000 with precision of 5 decimal places, and derivate being a random true/false value)

Nov 02 '19 21:11 GavinRay97

@GavinRay97 Also I suggest preallocate result array:

let results = new Array<f64>(x.lenght);

instead

let results: f64[] = [];

Nov 02 '19 21:11 MaxGraey

Btw performace for exp/log and pow will be increased even more soon. I found new implementations which use lookup tables. But it will be support only for -O3 opt level due to size concerns

Nov 02 '19 21:11 MaxGraey

@GavinRay97 Also I suggest preallocate result array:
let results = new Array<f64>(x.lenght);
instead
let results: f64[] = [];

Definitely. Whatever AssemblyScript or Rust code I write for benchmarking will undoubtedly be poor and sub-optimal. I have limited experience writing typed languages, and zero experience working with memory and low-level stuff like WASM.

I have no doubt that better equipped people (like yourself and @postspectacular) will be able to give feedback on how awful it is and provide much, much better implementations :joy:

Btw performace for exp/log and pow will be increased even more soon. I found new implementations which use lookup tables. But it will be support only for -O3 opt level

This is exciting! :eyes: :tada:

Nov 02 '19 21:11 GavinRay97

Also plz don't use Math.pow(x, 2). Currently we can't optimize this to x * x but it will be in future. You could absolutely safe replace x ** 2 to x * x and this increase speed even more

Nov 02 '19 21:11 MaxGraey

hey @GavinRay97 @MaxGraey - someone jumped the gun here a little bit, since I posted that gist earlier in our chat without much further comments (which were meant to be posted later).

I'm somewhat surprised to learn that the "native" exp is faster than the approximation now, since in my earlier tests on node v12.10 (MBP2010) the fastexp version of the large sigmoid loop with 1 million iterations was almost 2x faster than the Math.exp version of that same loop... but i also have to point out (and hoped that was clear) that this approximation was not my own impl, but taken from musicdsp.org and the point of this gist was merely to show that there're lots of possibilities the team could explore further here (incl. some of the ones Max just pointed out above :) )

I'd still recommend using raw pointers over arrays for these hottest/tightest loops, alone to avoid bounds checking and any form of indirection in these places. But again, this is my intuition from working w/ C and Max will be much better equipped to give perf tips for AssemblyScript!

Nov 02 '19 21:11 postspectacular

@postspectacular

My apologies if you had not intended for that to be linked. The past days have done a lot of experimenting with performance and benchmarking, and I wanted to start collecting all of my notes, debug logs, code snippets, and screenshots in one place as to not forget anything.

Did not expect to get such immediate feedback :sweat_smile:

More than anything I just wanted to make sure to have saved + catalogued what you mentioned so that it did not slip my mind later.

The idea being that @christianechevarria and @luiscarbonell could look over everything and develop a concrete plan or follow-up questions.

I can remove the link to the gist.

Nov 02 '19 22:11 GavinRay97

@postspectacular fastexp could be faster on [-1, 1] range:

native wasm exp: 306.030029296875ms
fastexp: 236.511962890625ms

https://webassembly.studio/?f=iwya4jby1wr

Nov 02 '19 22:11 MaxGraey

@GavinRay97 - no worries, i just felt like I needed to contextualize these snippets some more... all good!

@MaxGraey - thanks for reminding me of wasmstudio :) - I've uploaded some of my code from that gist there too and the results there are still similar to what i've found in node earlier (and contradictory to yours). But the only difference between the two benchmarked fn's is their version of exp():

https://webassembly.studio/?f=kg1rea6qvbn

(@GavinRay97 et al - once the project sandbox has initialized, first open the browser console, then press "build & run", timing info will be shown in console only... tests take ~12 sec altogether)

The tests are currently setup to work on normalized input data, executing 500 iterations on 1 million values each. The JS glue code is in /src/main.js. The AssemblyScript in /assembly/main.ts.

On my laptop (MBP2015, Chrome 78) I'm getting:

bench("sigmoidApproxPtr"); // 811.406005859375ms (100 iters) // 4046.489013671875ms (500 iters)

bench("sigmoidNatPtr"); // 1504.90380859375ms (100 iters) // 7569.575927734375ms (500 iters)

Getting consistent results after multiple runs... food for thought, I think! And please, this is NO upstaging attempt, whatsoever!

Nov 03 '19 00:11 postspectacular

@MaxGraey as mentioned above, i did actually use normalized values for this, since I assumed the values for this ML example (activation function) would be in that range, but also because the approximation has the least error in that interval (reference).

Nov 03 '19 00:11 postspectacular

yeah, if you confident that argument always stay in [-1, 1] bounds you could use faster approximation

Nov 03 '19 00:11 MaxGraey