ChakraCore
ChakraCore copied to clipboard
[Perfomance] For-of vs For non-cached vs For-cached
After running small benchmark on Windows 10 with x64_debug config, i've got these results:
13996ms (for-of), 90ms (for-non-cached), 63ms (for-cached) (on average after 10 subsequent runs)
For-of is slower than for-non-cached by 15551%
and For-of is slower than for-cached by 22215%!
Bench code:
{
const a = Array.from({ length: 1e7 }, _ => 0);
let start = (console.log('Start'), Date.now());
for (const _ of a);
console.log(Date.now() - start + 'ms (for-of)');
start = Date.now();
for (let i = 0; i < a.length; i++) { const _ = a[i]; }
console.log(Date.now() - start + 'ms (for-non-cached)');
start = Date.now();
const length = a.length;
for (let i = 0; i < length; i++) { const _ = a[i]; }
console.log(Date.now() - start + 'ms (for-cached)');
}
I can propose some optimizations to for-of loop:
- Array(Iterator) 1.1. ArrayIterator's builtin functions remain untouched. 1.2. cache length property if (array is frozen || array's length is not mutated) && it's or Array.prototype's length property is unchanged && array is not proxy, otherwise just keep referencing length property as excepted. 1.3. do same as in for-non-cached or for-cached cases skipping all iterator machinery.
- String(Iterator) 2.1. StringIterator's builtin functions remain untouched. 2.2. cache length property if string's object wrapper is not proxy, otherwise just keep referencing length property as excepted. 2.3. same as 1.3.
There's a few things to think about here:
- doing the benchmark this way the loop control is almost certainly never being jitted so you're mostly looking at the interpreter's speed; for-of does get better in the JIT (though it doesn't catch up with the others).
- length cacheing is sort-of done in the jit already
- Roughly what we're talking about here is introducing an operation at the start of a for...of loop that checks if the iterator is the "standard" iterator and if it is that does something different - this could work, particularly as per spec at the start of a for...of you access the iterator once then cache the
next
method so it doesn't get accessed again/can't be overwritten in the loop. Downside is that we'd potentially have to emit the bytecode of the loopbody twice.
Basically I think we'd have to do something like this:
- check iterator type, if it is "standard" jump to LABEL B
- do setup for normal for...of
- normal for...of loop control, when finished jump to LABEL C
- loop body - bytecode for everything inside the loop
- jump to step 3 - normal loop control
- LABEL B
- Do setup for optimised loop control
- optimised loop control
- loop body
- jump to step 7 - optimised loop control
- LABEL C
If the loop body had large content this could be a large increase in bytecode size - which may be non-optimal for some use-cases.
Another key part of the performance difference is that a for...of has an embedded try...catch...finally statement (to trigger iterator.return() in certain circumstances) - obviously unneeded when the iterator is the default.
I know that previous results are from interpreted mode, but here are results when same code is run in dynapogo mode (= with full JIT?):
Start
45390ms (for-of)
12ms (for-cached)
14ms (for-non-cached)
It's pretty bizarre to see result of for-of loop when you are comparing it with result from interpreted mode (3x increase in mode which meant to be more performant)
Make the code a function and call it - the jit doesn't always work so well on global code; or more specifically, global code not inside a loop isn't jitted at all.
Will IIFE work or i need named function to do that?
If you're using dynopogo flags you can just put everything inside one function, though if you're trying to do it with one run (and not using profile info) you probably want to do -mic:1 -off:simplejit -bgjit-
as the flags AND call the function twice like:
function test() {
// test code
}
test();
test();
That should give you 1 interpreted run and 1 fully jitted run.
Explanation of flags:
-mic:1
- run each function in the interpreter a maximum of 1 time before trying to jit it
-off:simplejit
- skip the intermediate/simple version of the jit
-bgjit-
- run the jit on the main thread (so you pause execution whilst jitting) - otherwise jit is run on a background thread
I still get 3-4x increase in dynapogo mode using flags
Code:
{
const a = Array.from({ length: 1e6 }, _ => 0);
function test(a) {
let start = (console.log('Start'), Date.now());
for (const _ of a);
console.log(Date.now() - start + 'ms (for-of)');
start = Date.now();
const length = a.length;
for (let i = 0; i < length; i++) { const _ = a[i]; }
console.log(Date.now() - start + 'ms (for-cached)');
start = Date.now();
for (let i = 0; i < a.length; i++) { const _ = a[i]; }
console.log(Date.now() - start + 'ms (for-non-cached)');
}
if (Array.isArray(a))
test(a);
}
Dynapogo mode PS code, with which i launch test code:
rm.exe -rf .\profile.dpl
.\ChakraCore\Build\VcBuild\bin\x64_debug\ch.exe .\test.js -maxInterpretCount:1 -maxSimpleJitRunCount:1 -bgjit- -dynamicprofilecache:profile.dpl -collectGarbage > $null
.\ChakraCore\Build\VcBuild\bin\x64_debug\ch.exe .\test.js -WERExceptionSupport -ExtendedErrorStackForTestHost -BaselineMode -forceNative -off:simpleJit -bgJitDelay:0 -dynamicprofileinput:profile.dpl -collectGarbage
There are 3 ways of building CC: Debug, Test and Release. Debug and Test both have the ability to use flags, Release does not. Debug contains lots of extra internal checking so runs significantly slower than the other two. Test and Release have similar performance profiles.
Upshot of that is that you probably need to use a test build not a debug build to get the most useful results here.
Note, testing this on macOS with the new WScript monotonicNow method and using flags to jit etc. I get these results:
RUN ONE - interpreted testForOf time: 586.1193810105324 testForNonCached time: 10.010149002075195 testForCached time: 11.063928008079528
RUN TWO - fullJit testForOf time: 153.81329196691513 testForNonCached time: 8.697234034538269 testForCached time: 8.02989399433136
I still get 2.5x increase in dynapogo mode, tested on x64_test build:
======================Interpreted======================
754.0545997619629ms (for-of)
48.02579975128174ms (for-cached)
78.80779981613159ms (for-non-cached)
=======================Dynapogo=======================
1784.9175000190735ms (for-of)
12.22700023651123ms (for-cached)
13.011600017547607ms (for-non-cached)
Is it possible that the time you're recording includes time spent jitting a loop body?
I see the difference - the way I was testing it, the Array iterator methods get jitted during the interpreter run then can get inlined in the jitted run; whereas with the dynapogo flags (dynamic profile info and forcenative) your test
function gets jitted first then the array iterator methods get jitted after so they can't be inlined.
Putting a single call to a[Symbol.iterator]().next();
before you call test(); gets rid of much of your discrepancy. and using -off:jitloopbody
(per @pleath's suggestion) makes the situation more stable still.
For reference my test case (with results above) was this:
function testForOf(arr)
{
const start = WScript.monotonicNow();
for (_ of arr) {
_;
}
return WScript.monotonicNow() - start;
}
function testForNonCached(arr)
{
const start = WScript.monotonicNow();
for (let i = 0; i < arr.length; ++i) {
arr[i];
}
return WScript.monotonicNow() - start;
}
function testForCached(arr)
{
const start = WScript.monotonicNow();
let len = arr.length
for (let i = 0; i < len; ++i) {
arr[i];
}
return WScript.monotonicNow() - start;
}
const a = Array.from({ length: 1e7 }, _ => 0);
print("RUN ONE - interpreted");
print("testForOf time: " + testForOf(a));
print("testForNonCached time: " + testForNonCached(a));
print("testForCached time: " + testForCached(a));
print("RUN TWO - fullJit");
print("testForOf time: " + testForOf(a));
print("testForNonCached time: " + testForNonCached(a));
print("testForCached time: " + testForCached(a));
And flags -mic:1 -bgjit- -off:simplejit
Thinking more on the proposed optimisation wondering if we could do something like this (pseudocode):
canOpt = Op_IsDefaultIterator()
if (canOpt)
{
initFasterLoop
fastLabel:
//some kind of op to put the first value into the iterator destination
}
else
{
initSlowLoop
slowLabel:
current logic
}
// loop body here
if (canOpt)
{
goto fastLabel
}
else
{
goto slowLabel
}
This does of course mean an extra condition to check every iteration of the loop BUT the jit should be able to hoist it - and when it hits it would be a relatively massive optimisation - though dealing with the try/catch/finally that are needed on the default path but not the fastpath would be awkward.
Is it possible that the time you're recording includes time spent jitting a loop body?
After addiction of -off:jitloopbody flag on run, which is run in interpreted mode, it really shows that time is spent jitting a loop body