wasmboy
wasmboy copied to clipboard
Channel3.getSample computes non-integer typed array index, leading to recompilation loop in Firefox
https://bugzilla.mozilla.org/show_bug.cgi?id=1515620 was filed yesterday against SpiderMonkey, based on the benchmark results at https://medium.com/@torch2424/webassembly-is-fast-a-real-world-benchmark-of-webassembly-vs-es6-d85a23f8e193. When investigating why SpiderMonkey was so bad at this particular benchmark, it turned out that the abysmal performance is caused by a recompilation loop in SpiderMonkey's optimizing compiler when inlining typed array accesses. This load
function
https://github.com/torch2424/wasmBoy/blob/693b160ecd9ae084b4459a08a930b2ef9108b255/core/portable/wasmMock.js#L18-L20
is currently called with non-integer inputs, like 59853.5
, where the fractional part .5
is caused by this division (*):
https://github.com/torch2424/wasmBoy/blob/693b160ecd9ae084b4459a08a930b2ef9108b255/core/sound/channel3.ts#L185
While the recompilation loop is certainly an issue which should be fixed in SpiderMonkey, the miscomputed typed array index looks like a bug in wasmBoy to me, too.
(*) I haven't checked if there are additional callers to load
which also pass in non-integer indices.
Oh wow! Super rad the article / project reached you guys at Mozilla! ๐
Ahhh! So to be honest, the Typescript compiled version did have an issue with Sound, and I knew it was because I wasn't sanitizing numbers (I think is the right term?) correctly, but the JS version wasn't meant to be playable as a user, just run the games so I figured it would be fine for now.
But, I'll definitely go through and fix this! I'll also be sure to add this issue to the article! I knew something was funny when I saw Firefox performance, and I was careful not too call out any engine being "better" because of stuff like this.
Thanks for the bug! I'll try to fix this and I'll update the article now. Thanks!
Edit: article updated in multiple places to link the bug, and left an "Edit log" at the bottom of the article ๐
Not sure if this is a 100% correct fix, but coercing the load/store offsets to ints significantly improve the benchmark results in Firefox version 64 for me (running the tobu tobu girl game).
--- index.iife.js.orig 2018-12-21 10:03:22.000000000 +0000
+++ index.iife.js 2018-12-21 19:46:58.428069475 +0000
@@ -1735,11 +1735,11 @@
};
const load = offset => {
- return wasmByteMemory[offset];
+ return wasmByteMemory[offset|0];
};
const store = (offset, value) => {
- wasmByteMemory[offset] = value;
+ wasmByteMemory[offset|0] = value;
};
const abs = value => {
vs
@cosinusoidally Yeah totally! Thanks for looking into this! Currently, I'm not at my home laptop, so I'll probably open up a PR later tonight to do all the coercing and stuff using the "portable" functions I have to handle all of the type differences between running with AssemblyScript and running with JS.
@torch2424 well all credit to @anba for tracking down the issue. I just spotted his comment here https://bugzilla.mozilla.org/show_bug.cgi?id=1515620#c3 and thought "hmm, I wonder if I can just tweak the generated JS to avoid this performance cliff"
Btw, the recommended portable way of writing this with AS is
let positionIndexToAdd = i32(Channel3.waveTablePosition / 2);
which is a nop in WASM and a |0
in JS. Hope this doesn't result in any issues with the i32
function though :)
So I went ahead and opened #218 Which as you can see definitely fixes the FireFox JS performance! But it still seems like something is off after we closure compile it.
In the meantime, I'll go ahead and push up the fix for anyone who happens to run the benchmark. ๐
So I went ahead and opened #218 Which as you can see definitely fixes the FireFox JS performance!
\o/
But it still seems like something is off after we closure compile it.
Hmm, that's strange. The Closure-compiled version is definitely faster than the non-compiled one for me. (Tested in under Windows and Linux, with Release/Beta/Nightly versions of Firefox.)
I did a bit of additional profiling (with the Firefox profiler addon) to see if there are any other easy optimization targets for the JS/TypeScript version. And the profile showed that averageFpsFromTimes
takes a considerable amount of time (about 14.25% of the complete runtime is spent there!), which is mostly caused by stringifying numbers in parseFloat
โ This will be fixed in https://bugzilla.mozilla.org/show_bug.cgi?id=1517235. (Even with that fix averageFpsFromTimes
is still kind of slow and takes ~6.25% of the complete runtime, because it's allocating and extending arrays constantly, but I guess that's more of an issue in the "stats-lite" package and not this project.)
The profile also showed that NumberMod
(the SpiderMonkey implementation function of the %
operator when called with double values) or more specifically Libc's fmod
takes about ~4% of the complete runtime. That struck me as odd, especially when I found where the modulo operator was applied:
https://github.com/torch2424/wasmBoy/blob/806cd8236e16f4a8bbb51c158218ec75a8a0c8de/core/graphics/backgroundWindow.ts#L214
In the TypeScript code pixelXPositionInMap
is typed as i32
, but somehow the optimizing compiler was using a double
here. With a bit of debugging I found out that the double value is generated here
https://github.com/torch2424/wasmBoy/blob/806cd8236e16f4a8bbb51c158218ec75a8a0c8de/core/graphics/backgroundWindow.ts#L63-L64
At one point during the benchmark run, windowX
is 0
and -1 * 0
is... drumroll :drum: -0
(negative zero), so a double value and not an int32. From that point on all computations with windowX
or any derived values need to typed as doubles in a JavaScript engine.
So, I could be interesting to see if properly converting the value back to an int32 has any effects in Safari or Chrome, because when I ran a simple standalone version of that code in the shell, the negative effects of using double instead of int32 were even worse in JavaScriptCore and V8.
I guess these two results were the most interesting ones from the profile. Oh, and there also seems to be an off-by-one error in https://github.com/torch2424/wasmBoy/blob/806cd8236e16f4a8bbb51c158218ec75a8a0c8de/demo/benchmark/loadrom.js#L60
I noticed when looking if there are any frequent bail-outs from the optimizing compiler (which I didn't see).
for (let i = 0; i <= coreObject.core.byteMemory.length; i++) {
coreObject.core.byteMemory[i] = 0;
}
better replace to:
coreObject.core.byteMemory.fill(0);
and yaeh <=
should be <
@anba
Hmm, that's strange. The Closure-compiled version is definitely faster than the non-compiled one for me. (Tested in under Windows and Linux, with Release/Beta/Nightly versions of Firefox.)
Huh, that is strange! I ran the benchmark, but maybe forgot to recompile the closure version that night? I'll try again once I get the chance ๐
I did a bit of additional profiling (with the Firefox profiler addon) to see if there are any other easy optimization targets for the JS/TypeScript version.
Wow! That's such a deep dive thank you so much! I've been fighting with performance on the emulator itself for a while, so I'm super stoked to see this ๐ I'll definitely go ahead and implement all of these things once I get the chance. Thank you again! I'm sure that took a lot of digging to find, and I very much appreciate it!
@MaxGraey
Thanks for the tip! I'll definitely go through and do a bunch of refactoring to use .fill
๐ Thank you for that!
@anba My apologies for falling so far behind on this. If you noticed I've had this issue pinned all this time, and left a lot of references to it in the article.
Just a quick status update,
Regarding the closure compiler thing, yeah it was definitely my fault haha! ๐
And I'm currently taking a break from my roadmap: #197 , so I'll try and go through and implement the lingering issues here ๐
Thanks again for all the help, it is very much appreciated!
So everything here is implemented now. Going to keep this open a little while longer in case any more feedback. And also, as a reminder to update the article.