+= to concat strings is heavily underperforming
Using concat method or array push this operation takes around 0.015 - 0.030 seconds.
const str1 = "abc";
const str2 = "defghijklmnñopqrstuvwxyz";
const addStrings = times => {
let str = "";
for(let i = 0; i < times; ++i) {
str += str1 + str2;
}
return str;
}
addStrings(30000); // 8 seconds
- This code in node takes 0.013 seconds to run.
- In quickjs takes 8.672 seconds
If you want to test, you can run next command:
curl https://raw.githubusercontent.com/StringManolo/javascript-performance-study/main/joinStrings/joinStrings.js -O && node joinStrings.js && qjs joinStrings.js
Notice qjs takes a few seconds to run the code.
I would guess v8 is optimizing the code and executing it at compile-time (the same program with a console.log is similarly fast). Quickjs runs it on my machine in 1.071 seconds. An equivalent ruby program is executed in 1.209 seconds, so I think it's more a testament to how good v8 is than quickjs being slow - execution speeds are comparable to other non-jit interpreters.
So this is kind of interesting. At first I thought this is because QuickJS lacks the unique-reference in-place concatenation optimisation. As it turns out, not quite: the optimisation is already there, but a number of implementation flaws (and some design flaws) in QuickJS prevent it from being actually useful, at least in interpreted code.
The problems are as follows:
-
No opcode for in-place concatenation of lexically-scoped variables
QuickJS compiles JavaScript functions into bytecode, which is then interpreted by a stack-based virtual machine. The central performance bottleneck is obviously
result += str1 + str2;. This is the unoptimised bytecode this statement compiles to:;; result += str1 + str2; 42: get_loc_check 0: result get_var str1 get_var str2 add add dup put_loc_check 0: result dropThe
get_loc_checkopcode puts the value of a lexically-scoped variable on the stack, whileput_loc_checkremoves a value from the stack and puts it in a lexically-scoped variable. The equivalents for function-scoped (var) variables areget_locandput_loc. Values are generally reference-counted (with cycle detection): putting a value on the stack increments its reference count, removing it decrements it. As such, the string cannot be concatenated in-place, because there are two references to it: one from the stack, the other from the local variable. We would like to have an opcode which can update the variable in place, without incrementing the reference count as an intermediate step. Such an opcode exists forvarvariables: its name isadd_loc. There is no equivalentadd_loc_checkforletvariables. -
Simplistic optimiser performs peephole analysis only
We can avoid the above flaw by declaring
resultwith thevarkeyword. But all this does is changeget_loc_checkandput_loc_checkintoget_locandput_loc. An optimisation step is necessary to transform the opcode stream to useadd_loc.QuickJS does, in fact, have an optimiser. It is able to simplify the
dup/put_loc/dropsequence into just aput_loc. To be able to useadd_loc, in full generality it ought to match onget_loc(X) / {Y} /add/put_loc(X) and transform it to {Y} /add_loc(X), for (Y) an arbitrary sequence of opcodes that touches neither the local variable (X) itself nor the stack slot in which its value is put byget_loc. The optimiser is nowhere near that sophisticated; it can only perform bounded-lookahead pattern matching against the opcode stream. There is no data flow analysis, just dumb substitution of fixed opcode patterns.We can avoid this flaw by putting the concatenation of
str1andstr2in an intermediate variable. Adding a variable to another variable generates an opcode sequenceget_loc(X) /get_loc(Y) /add/put_loc(X) that QuickJS optimises intoget_loc(Y) /add_loc(X):;; var tmp = str1 + str2; get_var str1 get_var str2 add put_loc2 2: tmp ;; result += tmp; get_loc2 2: tmp add_loc 0: result -
Spurious internal reference-count increments
Using the
add_locopcode avoids putting the string on the stack, so you would think that it would avoid incrementing the reference count. Except that the opcode’s implementation internally increments the reference count anyway:op1 = JS_ConcatString(ctx, JS_DupValue(ctx, *pv), op1); if (JS_IsException(op1)) goto exception; set_value(ctx, pv, op1);While the call to
JS_ConcatStringis active, there are two references to the string: one from the temporary (created byJS_DupValue), the other from the variable.JS_ConcatStringdrops the former,set_valuedrops the latter.We can address this with a kludge:
op1 = JS_ConcatString(ctx, *pv, op1); if (JS_IsException(op1)) goto exception; *pv = op1;Kind of ugly, but it gets the job done. Now the reference count stays at 1 the whole time.
-
No memory allocation amortisation
Now the punchline: even if the reference count is 1, this may still not be enough to reuse the allocation of one of the strings.
str = js_malloc_rt(rt, sizeof(JSString) + (max_len << is_wide_char) + 1 - is_wide_char);When allocating memory for strings, QuickJS requests no more space than required to store the string. The allocator is allowed to provide more storage; when QuickJS detects this, the allocation is reused. But far too often, there simply isn’t enough room, and the opportunity never arises. This triggers the slow path where a new string is allocated and the contents are copied manually: not even
js_reallocis used to reuse the old object’s data.By adding
js_reallocon top of the above patches, I managed to get execution time to drop on my machine from nearly 6 seconds to about 300 ms. (Your performance may vary: I compile QuickJS either with TinyCC or with ASan and UBSan enabled, either of which may slow things down.)
Good analysis. Our conclusion at the time was that it was not worth continuing on this track because there will still be corner cases which will give a large slowdown. Instead, a rope based string representation may be a simpler solution.
On 8/11/21 1:40 PM, Felix S. wrote:
So this is kind of interesting. At first I thought this is because QuickJS lacks the unique-reference in-place concatenation optimisation https://blog.ganssle.io/articles/2019/11/string-concat.html. As it turns out, not quite: the optimisation is already there, but a number of implementation flaws (and some design flaws) in QuickJS prevent it from being actually useful, at least in interpreted code.
The problems are as follows:
*No opcode for in-place concatenation of lexically-scoped variables* QuickJS compiles JavaScript functions into bytecode, which is then interpreted by a stack-based virtual machine. The central performance bottleneck is obviously |result += str1 + str2;|. This is the unoptimised bytecode this statement compiles to: |;; result += str1 + str2; 42: get_loc_check 0: result get_var str1 get_var str2 add add dup put_loc_check 0: result drop | The |get_loc_check| opcode puts the value of a lexically-scoped variable on the stack, while |put_loc_check| removes a value from the stack and puts it in a lexically-scoped variable. The equivalents for function-scoped (|var|) variables are |get_loc| and |put_loc|. Values are generally reference-counted (with cycle detection): putting a value on the stack increments its reference count, removing it decrements it. As such, the string cannot be concatenated in-place, because there are two references to it: one from the stack, the other from the local variable. We would like to have an opcode which can update the variable in place, without incrementing the reference count as an intermediate step. Such an opcode exists for |var| variables: its name is |add_loc|. There is no equivalent |add_loc_check| for |let| variables.
*Simplistic optimiser performs peephole analysis only* We can avoid the above flaw by declaring |result| with the |var| keyword. But all this does is change |get_loc_check| and |put_loc_check| into |get_loc| and |put_loc|. An optimisation step is necessary to transform the opcode stream to use |add_loc|. QuickJS does, in fact, have an optimiser. It is able to simplify the |dup| / |put_loc| / |drop| sequence into just a |put_loc|. To be able to use |add_loc|, in full generality it ought to match on |get_loc| (X) / {Y} / |add| / |put_loc| (X) and transform it to {Y} / |add_loc| (X), for (Y) an arbitrary sequence of opcodes that touches neither the local variable (X) itself nor the stack slot in which its value is put by |get_loc|. The optimiser is nowhere near that sophisticated; it can only perform bounded-lookahead pattern matching against the opcode stream. There is no data flow analysis, just dumb substitution of fixed opcode patterns. We can avoid this flaw by putting the concatenation of |str1| and |str2| in an intermediate variable. Adding a variable to another variable generates an opcode sequence |get_loc| (X) / |get_loc| (Y) / |add| / |put_loc| (X) that QuickJS optimises into |get_loc| (Y) / |add_loc| (X): |;; var tmp = str1 + str2; get_var str1 get_var str2 add put_loc2 2: tmp ;; result += tmp; get_loc2 2: tmp add_loc 0: result |
*Spurious /internal/ reference-count increments* Using the |add_loc| opcode avoids putting the string on the stack, so you would think that it would avoid incrementing the reference count. Except that the opcode’s implementation internally increments the reference count anyway: op1 = JS_ConcatString(ctx, JS_DupValue(ctx, *pv), op1); if (JS_IsException(op1)) goto exception; set_value(ctx, pv, op1); While the call to |JS_ConcatString| is active, there are two references to the string: one from the temporary (created by |JS_DupValue|), the other from the variable. |JS_ConcatString| drops the former, |set_value| drops the latter. We can address this with a kludge: op1 = JS_ConcatString(ctx, *pv, op1); if (JS_IsException(op1)) goto exception; *pv = op1; Kind of ugly, but it gets the job done. Now the reference count stays at 1 the whole time.
*No memory allocation amortisation* Now the punchline: even if the reference count is 1, this may still not be enough to reuse the allocation of one of the strings. str = js_malloc_rt(rt,sizeof(JSString) + (max_len << is_wide_char) +1 - is_wide_char); When allocating memory for strings, QuickJS requests no more space than required to store the string. The allocator is allowed to provide more storage; when QuickJS detects this, the allocation is reused. But far too often, there simply isn’t enough room, and the opportunity never arises. This triggers the slow path where a new string is allocated and the contents are copied manually: not even |js_realloc| is used to reuse the old object’s data. By adding |js_realloc| on top of the above patches, I managed to get execution time to drop on my machine from nearly 6 seconds to about 300 ms. (Your performance may vary: I compile QuickJS either with TinyCC or with ASan and UBSan enabled, either of which may slow things down.)— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bellard/quickjs/issues/67#issuecomment-896754687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRQQIERDTW54YJMARWGZELT4JOUNANCNFSM45SZ7GCA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.
I've been in awe at the excellent performance of quickjs, I really appreciate when performance emerges from simplicity and lightness, as opposed to complex layers upon layers of JIT and runtime analysis.
The only issue I'm running into is the fact that a good number of JS programmers assume that string concatenation is optimized. An example of this is the official Pug parser. A 1,000-line Pug file takes over 55 seconds to parse in quickjs versus ~100 ms in Node/V8.
Consider this a "+1" for this issue, I suppose! I love quickjs and I'd love to be able to trust it will work on any codebase, but until someone has time to implement the rope-based string representation I'll have to restrict my usage to the programs that won't exhibit problematic slowdowns. Thanks again for quickjs!
Doing tests I found that using += individually is faster see changes bellow and output:
/usr/bin/time qjs joinStrings.js
Config:
- Running 30000 times
Results:
strPlusStr -> 3732ms
strPlusStr2 -> 1689ms
strToArrayPush -> 8ms
strPush -> 5ms
strConcatStr -> 8ms
2.41user 3.03system 0:05.44elapsed 99%CPU (0avgtext+0avgdata 6880maxresident)k
0inputs+0outputs (0major+2113473minor)pagefaults 0swaps
joinStrings.js:
const str1 = "abc";
const str2 = "defghijklmnñopqrstuvwxyz";
const strPlusStr = times => {
let str = "";
for(let i = 0; i < times; ++i) {
str += str1 + str2;
}
return str;
}
const strPlusStr2 = times => {
let str = "";
for(let i = 0; i < times; ++i) {
str += str1;
str += str2;
}
return str;
}
const strToArrayPush = times => {
let str = [];
for(let i = 0; i < times; ++i) {
str.push(str1);
str.push(str2);
}
return str.join("");
}
const strPush = times => {
let str = [];
for(let i = 0; i < times; ++i) {
str.push(str1, str2);
}
return str.join("");
}
const strConcatStr = times => {
let str = [];;
for (let i = 0; i < times; ++i) {
str.push(str1.concat(str2));
}
return str.join("");
}
const _ = console.log;
//const _ = alert; // For Android Browser Testing
const test = times => {
let time;
let results = [];
time = new Date();
strPlusStr(times);
results.push(new Date() - time);
time = new Date();
strPlusStr2(times);
results.push(new Date() - time);
time = new Date();
strToArrayPush(times);
results.push(new Date() - time);
time = new Date();
strPush(times);
results.push(new Date() - time);
time = new Date();
strConcatStr(times);
results.push(new Date() - time);
_(`
Config:
- Running ${times} times
Results:
strPlusStr -> ${results[0]}ms
strPlusStr2 -> ${results[1]}ms
strToArrayPush -> ${results[2]}ms
strPush -> ${results[3]}ms
strConcatStr -> ${results[4]}ms
`);
}
/*
test(1)
test(10)
test(100)
test(1000)
test(10000)
test(100000) */
//test(1000000)
test(30000); // qjs bug on strPlusStr function
/* > qjs joinStrings.js
Config:
- Running 30000 times
Results:
strPlusStr -> 8100ms HERE
strToArrayPush -> 19ms
strConcatStr -> 24ms
> node joinStrings.js
Config:
- Running 30000 times
Results:
strPlusStr -> 13ms THIS IS ALL GOOD
strToArrayPush -> 14ms
strConcatStr -> 13ms
*/
Hello @fstirlitz I didn't saw your changes in your fork at https://github.com/fstirlitz/quickjs , could you please attach or copy and paste your diff/patch here ?
@bellard was it fixed in 2024.Jan release?
Tested with hermes engine. They do optimize += on string. However like @bellard said, there're corner cases += optimization won't work. For example, changing str = str + str1 to str = str1 + str (a rarely seen case in a loop I think), the hermes engine took 16778ms to complete vs qjs took 2407ms on my PC.
I agree the rope based string approach may be a correct direction for general string optimization.
The optimization is undoubtedly required, for example:
let longText;
...
// To append a newline to the end of long text, we have to duplicate the whole text.
// The time complexity is O(n) for appending one character, which is unacceptable!
longText += '\n';
// The below example takes 66531ms on my PC!
// Hermes finishs in 142ms
// And I think quickjs will behave better with similar optimization
let str = '';
for (let i = 0; i < 1000_000; i++) {
str += 'a';
}
But do we need a general string optimization and does that perform better in practice?
Here're my thoughts:
- I think
str = str + str1orstr += str1is a more commonly seen case. Now that string concat is used in almost everywhere, I think it would make a great improvement to the performance even we just optimize for this particular case. - It's easier to implement optimization for string assign equal operation than rope string.
- Rope string may perform well in frequently string editing. But it may not perform as well as a linear string structure for read operations because of the
Locality Principle(memory cache).
So in my opinions:
WRITE Rope based string is a general optimization approach for writing operations. But appending a value to another string's tail is more commonly seen. READ Linear memory layout performs better for read operations.
I just learn that V8 doesn't adopt rope string too. Maybe we can implement similar optimization on quickjs too.
Here're my brief thoughts.
- For tiny string, inline store to make use of memory cache.
- For large string, use heap.
- For string concats, use concat list to store references.
- For string slices, use slice reference.
Four string formats in total.
Let's discuss the operations on these formats:
Type 1 Read: It is straight. Concat:
- In place if still tiny.
- Reform to
Type 3otherwise.
Slice: Copy the slice range to new string.
Type 2 Read: It is straight. Concat:
- In place if append
Type 1string. - Reform to
Type 3otherwise.
Slice:
- Copy the slice range to new string if tiny.
- New string with
Type 4otherwise.
Type 3
Read: Estimate the cost to read. When it's cheaper to reform to Type 2 then do it. For example:
- Reading first block or last block is cheap.
- Frequently reading across inner blocks is expensive.
Concat: Remain Type 3.
Slice: New string with Type 4.
Type 4
Read: Do read on its reference(may cause its reference to reform but no change to itself).
Concat: Reform to Type 3.
Slice: New string with Type 4.
Here're my brief thoughts.
- For tiny string, inline store to make use of memory cache.
- For large string, use heap.
- For string concats, use concat list to store references.
- For string slices, use slice reference.
Thank you for your summary. Do string slices really deserve optimizing? I am not sure real world code would benefit from such an optimisation, adding more string types is somewhat painful and error prone.
Thanks for reply. I think slice does have some weight in string operations. For example, split and regexp. I think these two are commonly used.
I think you are right about it may introduce more difficulty and complexity in code implementation. For example, we need to consider restricting the depth of references in both concat link and slice.
But I think, they can be done progressively. We can do it simpler at first. For example, before reading u.str8 and u.str16 of JSString struct, we do a preread operation, which is responsible to turn concat link or slice into normal string so we can directly access from u.str8 and u.str16. For Inline String or Heap String, we can use the pointer directly. In this way, we don't introduce too many changes in read process. As for the write process, we can do it in both the old way and the new way at first. If we do it in the old way, no changes are needed. If we do it in the new way, preread will make it right for reading. Anyway, it is progressive.
If simple optimizations are not good enough, we can progressively introduce more deeper optimizations.
Actually, I'm implementing them on my own when I have free time. It's a rough implementation, but I will update the benchmark comparison here after I finish. We will see if it is worthy.