quickjs += to concat strings is heavily underperforming

Using concat method or array push this operation takes around 0.015 - 0.030 seconds.

const str1 = "abc";
const str2 = "defghijklmnñopqrstuvwxyz";

const addStrings = times => {
  let str = "";
  for(let i = 0; i < times; ++i) {
    str += str1 + str2;
  }
  return str;
}

addStrings(30000); // 8 seconds

This code in node takes 0.013 seconds to run.
In quickjs takes 8.672 seconds

If you want to test, you can run next command:

curl https://raw.githubusercontent.com/StringManolo/javascript-performance-study/main/joinStrings/joinStrings.js -O && node joinStrings.js && qjs joinStrings.js

Notice qjs takes a few seconds to run the code.

May 26 '21 20:05 StringManolo

I would guess v8 is optimizing the code and executing it at compile-time (the same program with a console.log is similarly fast). Quickjs runs it on my machine in 1.071 seconds. An equivalent ruby program is executed in 1.209 seconds, so I think it's more a testament to how good v8 is than quickjs being slow - execution speeds are comparable to other non-jit interpreters.

Jul 18 '21 18:07 ibraheemdev

So this is kind of interesting. At first I thought this is because QuickJS lacks the unique-reference in-place concatenation optimisation. As it turns out, not quite: the optimisation is already there, but a number of implementation flaws (and some design flaws) in QuickJS prevent it from being actually useful, at least in interpreted code.

The problems are as follows:

No opcode for in-place concatenation of lexically-scoped variables

QuickJS compiles JavaScript functions into bytecode, which is then interpreted by a stack-based virtual machine. The central performance bottleneck is obviously result += str1 + str2;. This is the unoptimised bytecode this statement compiles to:
```
;;     result += str1 + str2;

   42:  get_loc_check 0: result
        get_var str1
        get_var str2
        add
        add
        dup
        put_loc_check 0: result
        drop
```
The get_loc_check opcode puts the value of a lexically-scoped variable on the stack, while put_loc_check removes a value from the stack and puts it in a lexically-scoped variable. The equivalents for function-scoped (var) variables are get_loc and put_loc. Values are generally reference-counted (with cycle detection): putting a value on the stack increments its reference count, removing it decrements it. As such, the string cannot be concatenated in-place, because there are two references to it: one from the stack, the other from the local variable. We would like to have an opcode which can update the variable in place, without incrementing the reference count as an intermediate step. Such an opcode exists for var variables: its name is add_loc. There is no equivalent add_loc_check for let variables.
Simplistic optimiser performs peephole analysis only

We can avoid the above flaw by declaring result with the var keyword. But all this does is change get_loc_check and put_loc_check into get_loc and put_loc. An optimisation step is necessary to transform the opcode stream to use add_loc.

QuickJS does, in fact, have an optimiser. It is able to simplify the dup / put_loc / drop sequence into just a put_loc. To be able to use add_loc, in full generality it ought to match on get_loc (X) / {Y} / add / put_loc (X) and transform it to {Y} / add_loc (X), for (Y) an arbitrary sequence of opcodes that touches neither the local variable (X) itself nor the stack slot in which its value is put by get_loc. The optimiser is nowhere near that sophisticated; it can only perform bounded-lookahead pattern matching against the opcode stream. There is no data flow analysis, just dumb substitution of fixed opcode patterns.

We can avoid this flaw by putting the concatenation of str1 and str2 in an intermediate variable. Adding a variable to another variable generates an opcode sequence get_loc (X) / get_loc (Y) / add / put_loc (X) that QuickJS optimises into get_loc (Y) / add_loc (X):
```
;;     var tmp = str1 + str2;

        get_var str1
        get_var str2
        add
        put_loc2 2: tmp

;;     result += tmp;

        get_loc2 2: tmp
        add_loc 0: result
```
Spurious internal reference-count increments

Using the add_loc opcode avoids putting the string on the stack, so you would think that it would avoid incrementing the reference count. Except that the opcode’s implementation internally increments the reference count anyway:
```
                    op1 = JS_ConcatString(ctx, JS_DupValue(ctx, *pv), op1);
                    if (JS_IsException(op1))
                        goto exception;
                    set_value(ctx, pv, op1);
```
While the call to JS_ConcatString is active, there are two references to the string: one from the temporary (created by JS_DupValue), the other from the variable. JS_ConcatString drops the former, set_value drops the latter.

We can address this with a kludge:
```
                    op1 = JS_ConcatString(ctx, *pv, op1);
                    if (JS_IsException(op1))
                        goto exception;
                    *pv = op1;
```
Kind of ugly, but it gets the job done. Now the reference count stays at 1 the whole time.
No memory allocation amortisation

Now the punchline: even if the reference count is 1, this may still not be enough to reuse the allocation of one of the strings.
```
    str = js_malloc_rt(rt, sizeof(JSString) + (max_len << is_wide_char) + 1 - is_wide_char);
```
When allocating memory for strings, QuickJS requests no more space than required to store the string. The allocator is allowed to provide more storage; when QuickJS detects this, the allocation is reused. But far too often, there simply isn’t enough room, and the opportunity never arises. This triggers the slow path where a new string is allocated and the contents are copied manually: not even js_realloc is used to reuse the old object’s data.

By adding js_realloc on top of the above patches, I managed to get execution time to drop on my machine from nearly 6 seconds to about 300 ms. (Your performance may vary: I compile QuickJS either with TinyCC or with ASan and UBSan enabled, either of which may slow things down.)

Aug 11 '21 11:08 fstirlitz

Good analysis. Our conclusion at the time was that it was not worth continuing on this track because there will still be corner cases which will give a large slowdown. Instead, a rope based string representation may be a simpler solution.

On 8/11/21 1:40 PM, Felix S. wrote:

So this is kind of interesting. At first I thought this is because QuickJS lacks the unique-reference in-place concatenation optimisation https://blog.ganssle.io/articles/2019/11/string-concat.html. As it turns out, not quite: the optimisation is already there, but a number of implementation flaws (and some design flaws) in QuickJS prevent it from being actually useful, at least in interpreted code.

The problems are as follows:

*No opcode for in-place concatenation of lexically-scoped variables*

QuickJS compiles JavaScript functions into bytecode, which is then
interpreted by a stack-based virtual machine. The central
performance bottleneck is obviously |result += str1 + str2;|. This
is the unoptimised bytecode this statement compiles to:

|;; result += str1 + str2; 42: get_loc_check 0: result get_var str1
get_var str2 add add dup put_loc_check 0: result drop |

The |get_loc_check| opcode puts the value of a lexically-scoped
variable on the stack, while |put_loc_check| removes a value from
the stack and puts it in a lexically-scoped variable. The
equivalents for function-scoped (|var|) variables are |get_loc| and
|put_loc|. Values are generally reference-counted (with cycle
detection): putting a value on the stack increments its reference
count, removing it decrements it. As such, the string cannot be
concatenated in-place, because there are two references to it: one
from the stack, the other from the local variable. We would like to
have an opcode which can update the variable in place, without
incrementing the reference count as an intermediate step. Such an
opcode exists for |var| variables: its name is |add_loc|. There is
no equivalent |add_loc_check| for |let| variables.

*Simplistic optimiser performs peephole analysis only*

We can avoid the above flaw by declaring |result| with the |var|
keyword. But all this does is change |get_loc_check| and
|put_loc_check| into |get_loc| and |put_loc|. An optimisation step
is necessary to transform the opcode stream to use |add_loc|.

QuickJS does, in fact, have an optimiser. It is able to simplify the
|dup| / |put_loc| / |drop| sequence into just a |put_loc|. To be
able to use |add_loc|, in full generality it ought to match on
|get_loc| (X) / {Y} / |add| / |put_loc| (X) and transform it to {Y}
/ |add_loc| (X), for (Y) an arbitrary sequence of opcodes that
touches neither the local variable (X) itself nor the stack slot in
which its value is put by |get_loc|. The optimiser is nowhere near
that sophisticated; it can only perform bounded-lookahead pattern
matching against the opcode stream. There is no data flow analysis,
just dumb substitution of fixed opcode patterns.

We can avoid this flaw by putting the concatenation of |str1| and
|str2| in an intermediate variable. Adding a variable to another
variable generates an opcode sequence |get_loc| (X) / |get_loc| (Y)
/ |add| / |put_loc| (X) that QuickJS optimises into |get_loc| (Y) /
|add_loc| (X):

|;; var tmp = str1 + str2; get_var str1 get_var str2 add put_loc2 2:
tmp ;; result += tmp; get_loc2 2: tmp add_loc 0: result |

*Spurious /internal/ reference-count increments*

Using the |add_loc| opcode avoids putting the string on the stack,
so you would think that it would avoid incrementing the reference
count. Except that the opcode’s implementation internally increments
the reference count anyway:

                     op1 = JS_ConcatString(ctx, JS_DupValue(ctx, *pv), op1);
                     if  (JS_IsException(op1))
                         goto  exception;
                     set_value(ctx, pv, op1);

While the call to |JS_ConcatString| is active, there are two
references to the string: one from the temporary (created by
|JS_DupValue|), the other from the variable. |JS_ConcatString| drops
the former, |set_value| drops the latter.

We can address this with a kludge:

                     op1 = JS_ConcatString(ctx, *pv, op1);
                     if  (JS_IsException(op1))
                         goto  exception;
                     *pv = op1;

Kind of ugly, but it gets the job done. Now the reference count
stays at 1 the whole time.

*No memory allocation amortisation*

Now the punchline: even if the reference count is 1, this may still
not be enough to reuse the allocation of one of the strings.

     str = js_malloc_rt(rt,sizeof(JSString) + (max_len << is_wide_char) +1  - is_wide_char);

When allocating memory for strings, QuickJS requests no more space
than required to store the string. The allocator is allowed to
provide more storage; when QuickJS detects this, the allocation is
reused. But far too often, there simply isn’t enough room, and the
opportunity never arises. This triggers the slow path where a new
string is allocated and the contents are copied manually: not even
|js_realloc| is used to reuse the old object’s data.

By adding |js_realloc| on top of the above patches, I managed to get
execution time to drop on my machine from nearly 6 seconds to about
300 ms. (Your performance may vary: I compile QuickJS either with
TinyCC or with ASan and UBSan enabled, either of which may slow
things down.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bellard/quickjs/issues/67#issuecomment-896754687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRQQIERDTW54YJMARWGZELT4JOUNANCNFSM45SZ7GCA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

Aug 12 '21 08:08 bellard

I've been in awe at the excellent performance of quickjs, I really appreciate when performance emerges from simplicity and lightness, as opposed to complex layers upon layers of JIT and runtime analysis.

The only issue I'm running into is the fact that a good number of JS programmers assume that string concatenation is optimized. An example of this is the official Pug parser. A 1,000-line Pug file takes over 55 seconds to parse in quickjs versus ~100 ms in Node/V8.

Consider this a "+1" for this issue, I suppose! I love quickjs and I'd love to be able to trust it will work on any codebase, but until someone has time to implement the rope-based string representation I'll have to restrict my usage to the programs that won't exhibit problematic slowdowns. Thanks again for quickjs!

Sep 19 '22 13:09 SGrondin

Doing tests I found that using += individually is faster see changes bellow and output:

/usr/bin/time qjs  joinStrings.js 

Config:
  - Running 30000 times

Results:
strPlusStr -> 3732ms
strPlusStr2 -> 1689ms
strToArrayPush -> 8ms
strPush -> 5ms
strConcatStr -> 8ms

2.41user 3.03system 0:05.44elapsed 99%CPU (0avgtext+0avgdata 6880maxresident)k
0inputs+0outputs (0major+2113473minor)pagefaults 0swaps

joinStrings.js:

const str1 = "abc";
const str2 = "defghijklmnñopqrstuvwxyz";


const strPlusStr = times => {
  let str = "";
  for(let i = 0; i < times; ++i) {
    str += str1 +  str2;
  }
  return str;
}

const strPlusStr2 = times => {
  let str = "";
  for(let i = 0; i < times; ++i) {
    str += str1;
    str += str2;
  }
  return str;
}

const strToArrayPush = times => {
  let str = [];
  for(let i = 0; i < times; ++i) {
    str.push(str1);
    str.push(str2);
  }
  return str.join("");
}

const strPush = times => {
  let str = [];
  for(let i = 0; i < times; ++i) {
    str.push(str1, str2);
  }
  return str.join("");
}

const strConcatStr = times => {
  let str = [];;
  for (let i = 0; i < times; ++i) {
    str.push(str1.concat(str2));
  }
  return str.join("");
}


const _ = console.log;
//const _ = alert; // For Android Browser Testing


const test = times => {
  let time;
  let results = [];

  time = new Date();
  strPlusStr(times);
  results.push(new Date() - time);

  time = new Date();
  strPlusStr2(times);
  results.push(new Date() - time);

  time = new Date();
  strToArrayPush(times);
  results.push(new Date() - time);

  time = new Date();
  strPush(times);
  results.push(new Date() - time);

  time = new Date();
  strConcatStr(times);
  results.push(new Date() - time);

  _(`
Config:
  - Running ${times} times

Results:
strPlusStr -> ${results[0]}ms
strPlusStr2 -> ${results[1]}ms
strToArrayPush -> ${results[2]}ms
strPush -> ${results[3]}ms
strConcatStr -> ${results[4]}ms
`);
}

/*
test(1)
test(10)
test(100)
test(1000)
test(10000)
test(100000) */
//test(1000000)
test(30000); // qjs bug on strPlusStr function

/* > qjs joinStrings.js

Config:
  - Running 30000 times

Results:
strPlusStr -> 8100ms           HERE
strToArrayPush -> 19ms
strConcatStr -> 24ms


> node joinStrings.js

Config:
  - Running 30000 times

Results:
strPlusStr -> 13ms        THIS IS ALL GOOD
strToArrayPush -> 14ms
strConcatStr -> 13ms

*/

Jan 06 '23 14:01 mingodad

Hello @fstirlitz I didn't saw your changes in your fork at https://github.com/fstirlitz/quickjs , could you please attach or copy and paste your diff/patch here ?

Jan 06 '23 14:01 mingodad

@bellard was it fixed in 2024.Jan release?

Feb 09 '24 04:02 funny-falcon

Tested with hermes engine. They do optimize += on string. However like @bellard said, there're corner cases += optimization won't work. For example, changing str = str + str1 to str = str1 + str (a rarely seen case in a loop I think), the hermes engine took 16778ms to complete vs qjs took 2407ms on my PC.

I agree the rope based string approach may be a correct direction for general string optimization.

The optimization is undoubtedly required, for example:

let longText;
...
// To append a newline to the end of long text, we have to duplicate the whole text.
// The time complexity is O(n) for appending one character, which is unacceptable!
longText += '\n';

// The below example takes 66531ms on my PC!
// Hermes finishs in 142ms
// And I think quickjs will behave better with similar optimization
let str = '';
for (let i = 0; i < 1000_000; i++) {
  str += 'a';
}

But do we need a general string optimization and does that perform better in practice?

Here're my thoughts:

I think str = str + str1 or str += str1 is a more commonly seen case. Now that string concat is used in almost everywhere, I think it would make a great improvement to the performance even we just optimize for this particular case.
It's easier to implement optimization for string assign equal operation than rope string.
Rope string may perform well in frequently string editing. But it may not perform as well as a linear string structure for read operations because of the Locality Principle (memory cache).

So in my opinions:

WRITE Rope based string is a general optimization approach for writing operations. But appending a value to another string's tail is more commonly seen. READ Linear memory layout performs better for read operations.

Jul 12 '24 14:07 laishere

I just learn that V8 doesn't adopt rope string too. Maybe we can implement similar optimization on quickjs too.

Here're my brief thoughts.

For tiny string, inline store to make use of memory cache.
For large string, use heap.
For string concats, use concat list to store references.
For string slices, use slice reference.

Four string formats in total.

Let's discuss the operations on these formats:

Type 1 Read: It is straight. Concat:

In place if still tiny.
Reform to Type 3 otherwise.

Slice: Copy the slice range to new string.

Type 2 Read: It is straight. Concat:

In place if append Type 1 string.
Reform to Type 3 otherwise.

Slice:

Copy the slice range to new string if tiny.
New string with Type 4 otherwise.

Type 3 Read: Estimate the cost to read. When it's cheaper to reform to Type 2 then do it. For example:

Reading first block or last block is cheap.
Frequently reading across inner blocks is expensive.

Concat: Remain Type 3. Slice: New string with Type 4.

Type 4 Read: Do read on its reference(may cause its reference to reform but no change to itself). Concat: Reform to Type 3. Slice: New string with Type 4.

Jul 12 '24 16:07 laishere

Here're my brief thoughts.

For tiny string, inline store to make use of memory cache.

For large string, use heap.

For string concats, use concat list to store references.

For string slices, use slice reference.

Thank you for your summary. Do string slices really deserve optimizing? I am not sure real world code would benefit from such an optimisation, adding more string types is somewhat painful and error prone.

Jul 14 '24 22:07 chqrlie

Thanks for reply. I think slice does have some weight in string operations. For example, split and regexp. I think these two are commonly used.

I think you are right about it may introduce more difficulty and complexity in code implementation. For example, we need to consider restricting the depth of references in both concat link and slice.

But I think, they can be done progressively. We can do it simpler at first. For example, before reading u.str8 and u.str16 of JSString struct, we do a preread operation, which is responsible to turn concat link or slice into normal string so we can directly access from u.str8 and u.str16. For Inline String or Heap String, we can use the pointer directly. In this way, we don't introduce too many changes in read process. As for the write process, we can do it in both the old way and the new way at first. If we do it in the old way, no changes are needed. If we do it in the new way, preread will make it right for reading. Anyway, it is progressive.

If simple optimizations are not good enough, we can progressively introduce more deeper optimizations.

Actually, I'm implementing them on my own when I have free time. It's a rough implementation, but I will update the benchmark comparison here after I finish. We will see if it is worthy.

Jul 15 '24 01:07 laishere