wasmboy icon indicating copy to clipboard operation
wasmboy copied to clipboard

Channel3.getSample computes non-integer typed array index, leading to recompilation loop in Firefox

Open anba opened this issue 5 years ago โ€ข 11 comments

https://bugzilla.mozilla.org/show_bug.cgi?id=1515620 was filed yesterday against SpiderMonkey, based on the benchmark results at https://medium.com/@torch2424/webassembly-is-fast-a-real-world-benchmark-of-webassembly-vs-es6-d85a23f8e193. When investigating why SpiderMonkey was so bad at this particular benchmark, it turned out that the abysmal performance is caused by a recompilation loop in SpiderMonkey's optimizing compiler when inlining typed array accesses. This load function https://github.com/torch2424/wasmBoy/blob/693b160ecd9ae084b4459a08a930b2ef9108b255/core/portable/wasmMock.js#L18-L20

is currently called with non-integer inputs, like 59853.5, where the fractional part .5 is caused by this division (*): https://github.com/torch2424/wasmBoy/blob/693b160ecd9ae084b4459a08a930b2ef9108b255/core/sound/channel3.ts#L185

While the recompilation loop is certainly an issue which should be fixed in SpiderMonkey, the miscomputed typed array index looks like a bug in wasmBoy to me, too.

(*) I haven't checked if there are additional callers to load which also pass in non-integer indices.

anba avatar Dec 21 '18 14:12 anba

Oh wow! Super rad the article / project reached you guys at Mozilla! ๐Ÿ˜„

Ahhh! So to be honest, the Typescript compiled version did have an issue with Sound, and I knew it was because I wasn't sanitizing numbers (I think is the right term?) correctly, but the JS version wasn't meant to be playable as a user, just run the games so I figured it would be fine for now.

But, I'll definitely go through and fix this! I'll also be sure to add this issue to the article! I knew something was funny when I saw Firefox performance, and I was careful not too call out any engine being "better" because of stuff like this.

Thanks for the bug! I'll try to fix this and I'll update the article now. Thanks!

Edit: article updated in multiple places to link the bug, and left an "Edit log" at the bottom of the article ๐Ÿ˜„

torch2424 avatar Dec 21 '18 19:12 torch2424

Not sure if this is a 100% correct fix, but coercing the load/store offsets to ints significantly improve the benchmark results in Firefox version 64 for me (running the tobu tobu girl game).

--- index.iife.js.orig  2018-12-21 10:03:22.000000000 +0000
+++ index.iife.js       2018-12-21 19:46:58.428069475 +0000
@@ -1735,11 +1735,11 @@
   };
 
   const load = offset => {
-    return wasmByteMemory[offset];
+    return wasmByteMemory[offset|0];
   };
 
   const store = (offset, value) => {
-    wasmByteMemory[offset] = value;
+    wasmByteMemory[offset|0] = value;
   };
 
   const abs = value => {

firefox-64-slow vs firefox-64-fast

cosinusoidally avatar Dec 21 '18 20:12 cosinusoidally

@cosinusoidally Yeah totally! Thanks for looking into this! Currently, I'm not at my home laptop, so I'll probably open up a PR later tonight to do all the coercing and stuff using the "portable" functions I have to handle all of the type differences between running with AssemblyScript and running with JS.

torch2424 avatar Dec 21 '18 20:12 torch2424

@torch2424 well all credit to @anba for tracking down the issue. I just spotted his comment here https://bugzilla.mozilla.org/show_bug.cgi?id=1515620#c3 and thought "hmm, I wonder if I can just tweak the generated JS to avoid this performance cliff"

cosinusoidally avatar Dec 21 '18 21:12 cosinusoidally

Btw, the recommended portable way of writing this with AS is

let positionIndexToAdd = i32(Channel3.waveTablePosition / 2); 

which is a nop in WASM and a |0 in JS. Hope this doesn't result in any issues with the i32 function though :)

dcodeIO avatar Dec 21 '18 22:12 dcodeIO

So I went ahead and opened #218 Which as you can see definitely fixes the FireFox JS performance! But it still seems like something is off after we closure compile it.

In the meantime, I'll go ahead and push up the fix for anyone who happens to run the benchmark. ๐Ÿ˜„

torch2424 avatar Dec 22 '18 07:12 torch2424

So I went ahead and opened #218 Which as you can see definitely fixes the FireFox JS performance!

\o/

But it still seems like something is off after we closure compile it.

Hmm, that's strange. The Closure-compiled version is definitely faster than the non-compiled one for me. (Tested in under Windows and Linux, with Release/Beta/Nightly versions of Firefox.)


I did a bit of additional profiling (with the Firefox profiler addon) to see if there are any other easy optimization targets for the JS/TypeScript version. And the profile showed that averageFpsFromTimes takes a considerable amount of time (about 14.25% of the complete runtime is spent there!), which is mostly caused by stringifying numbers in parseFloat โ†’ This will be fixed in https://bugzilla.mozilla.org/show_bug.cgi?id=1517235. (Even with that fix averageFpsFromTimes is still kind of slow and takes ~6.25% of the complete runtime, because it's allocating and extending arrays constantly, but I guess that's more of an issue in the "stats-lite" package and not this project.)

The profile also showed that NumberMod (the SpiderMonkey implementation function of the % operator when called with double values) or more specifically Libc's fmod takes about ~4% of the complete runtime. That struck me as odd, especially when I found where the modulo operator was applied: https://github.com/torch2424/wasmBoy/blob/806cd8236e16f4a8bbb51c158218ec75a8a0c8de/core/graphics/backgroundWindow.ts#L214

In the TypeScript code pixelXPositionInMap is typed as i32, but somehow the optimizing compiler was using a double here. With a bit of debugging I found out that the double value is generated here https://github.com/torch2424/wasmBoy/blob/806cd8236e16f4a8bbb51c158218ec75a8a0c8de/core/graphics/backgroundWindow.ts#L63-L64

At one point during the benchmark run, windowX is 0 and -1 * 0 is... drumroll :drum: -0 (negative zero), so a double value and not an int32. From that point on all computations with windowX or any derived values need to typed as doubles in a JavaScript engine.

So, I could be interesting to see if properly converting the value back to an int32 has any effects in Safari or Chrome, because when I ran a simple standalone version of that code in the shell, the negative effects of using double instead of int32 were even worse in JavaScriptCore and V8.

I guess these two results were the most interesting ones from the profile. Oh, and there also seems to be an off-by-one error in https://github.com/torch2424/wasmBoy/blob/806cd8236e16f4a8bbb51c158218ec75a8a0c8de/demo/benchmark/loadrom.js#L60

I noticed when looking if there are any frequent bail-outs from the optimizing compiler (which I didn't see).

anba avatar Jan 02 '19 22:01 anba

for (let i = 0; i <= coreObject.core.byteMemory.length; i++) {
   coreObject.core.byteMemory[i] = 0;
}

better replace to:

coreObject.core.byteMemory.fill(0);

and yaeh <= should be <

MaxGraey avatar Jan 03 '19 00:01 MaxGraey

@anba

Hmm, that's strange. The Closure-compiled version is definitely faster than the non-compiled one for me. (Tested in under Windows and Linux, with Release/Beta/Nightly versions of Firefox.)

Huh, that is strange! I ran the benchmark, but maybe forgot to recompile the closure version that night? I'll try again once I get the chance ๐Ÿ˜„

I did a bit of additional profiling (with the Firefox profiler addon) to see if there are any other easy optimization targets for the JS/TypeScript version.

Wow! That's such a deep dive thank you so much! I've been fighting with performance on the emulator itself for a while, so I'm super stoked to see this ๐ŸŽ‰ I'll definitely go ahead and implement all of these things once I get the chance. Thank you again! I'm sure that took a lot of digging to find, and I very much appreciate it!

@MaxGraey

Thanks for the tip! I'll definitely go through and do a bunch of refactoring to use .fill ๐Ÿ™ƒ Thank you for that!

torch2424 avatar Jan 03 '19 01:01 torch2424

@anba My apologies for falling so far behind on this. If you noticed I've had this issue pinned all this time, and left a lot of references to it in the article.

Just a quick status update,

Regarding the closure compiler thing, yeah it was definitely my fault haha! ๐Ÿ™ƒ

And I'm currently taking a break from my roadmap: #197 , so I'll try and go through and implement the lingering issues here ๐Ÿ˜„

Thanks again for all the help, it is very much appreciated!

torch2424 avatar Feb 21 '19 06:02 torch2424

So everything here is implemented now. Going to keep this open a little while longer in case any more feedback. And also, as a reminder to update the article.

torch2424 avatar Feb 22 '19 07:02 torch2424