onigasm icon indicating copy to clipboard operation
onigasm copied to clipboard

Tackling performance issues when searching on the same line multiple times

Open aeschli opened this issue 6 years ago • 10 comments

vscode-textmate uses oniguruma (and soon onigasm) to tokenize source code. It does that line by line and by calling 'findNextMatch' with increasing offset and a varied set of regexp-patterns until all tokens of a line have been evaluated. The set of regexp-patterns can be different for each find, based on the state of the tokenizer. The tokenizer is a state machine where the previously found tokens define the state the tokenizer is in.

Tokenizing a long line is quite expensive. The extreme case is uglified, optimized source code where all newlines are removed and a whole content consist of just one line.

To improve the performance of tokenizing a long line, the following characteristics of the tokenize algorithm can be used:

  • the same string is searched on multiple times
  • some regexp-pattern sets are used multiple times with a line, e.g. if they are associated to a 'base' state that the tokenizer state machine falls in often.
  • some regexp patterns appear in multiple pattern-sets. An example is the regexp to find a comment token which is present in almost every pattern-set. Similarly, the regexp-patterns to find identifiers or keywords.

Based on these observations there's the potential to do the some performance improvements. Most of them are already implemented in node-oniguruma.

  • store the encoded UTF-8 string and the UTF-8 to UTF-16 offset mapping table along with the input string to avoid re-encoding and recalculating the offsets. The first part is nicely tackled in #5
  • remembers the last search result of each regexp-pattern. The previous search result can be reused if:
    • the previous input string was the same
    • if the current search position is same or larger than the previous search position
    • if the result offset is larger than the current search position
    • often, the first find done on a string is done with index 0. In the best case, no match was found on the whole line, even better if that pattern was the 'comment' pattern. See here how it is done in node-oniguruma.
  • reuse reg-exp pattern search result across pattern-sets (if I'm not mistaken, 'node-oniguruma' doesn't do that)

aeschli avatar Jun 15 '18 07:06 aeschli

it stores the encoded UTF-8 string and the UTF-8 to UTF-16 offset mapping table along with the input string to avoid re-encoding and recalculating the offsets. The first part is nicely tackled in #5

What do you think about moving utf8<->utf16 mapping system from C to JS side. And likewise, copy that to HEAPU8 also every time findNextMatch is called?

You see, we'll be calling onigasmH.HEAPU8.set(onigStr.utf8Bytes, strPtr) anyway, might aswell copy the utf8<->utf16 mapping table along with it. Preferably in one call to HEAPU8.set?

zikaari avatar Jun 15 '18 08:06 zikaari

Moving the mapping system to the JS side sounds like a good idea. Encoding is already on the JS side, so why not remember the mapping and remember it in the OnigString?

  • let the OnigString have a table that contains the UTF-8 offset of every JS string character (or every 8th or 32th character, if that takes too much memory (requires iterating through characters again to find location for characters in between)
  • already pass a UTF 8 start offset to the native findBestMatch
  • let findBestMatch return UTF-8 offsets and do the conversion on demand in the OnigMatch on the JS side

I'm not sure if I'm missing something but I think a single table is enough.

aeschli avatar Jun 15 '18 14:06 aeschli

PR #8 | Another attempt at improving performance

cc stakeholders Theia | @epatpol @marechal-p @svenefftinge CodeSandbox | @compuives (should you decide to use TextMate grammars also)

zikaari avatar Jun 17 '18 06:06 zikaari

Hi guys, do you still have the plan to work on the performance here? @NeekSandhu @aeschli

I noticed a huge performance problem when running the same regex on the similar lines. During testing file around 11mb, it took the onigasm around three times more than on Oniguruma.

pekunicki avatar Nov 03 '18 08:11 pekunicki

@pekunicki

A lot of improvements were made since this issue was first opened. The primary performance throttling culprit, that is, encoding and offset mapping, were successfully tackled by "caching" those elements on OnigString object itself instead of computing them every time.

To leverage those benefits, you must avoid passing "raw" literal strings to regex functions, as JS has no way to do reference equality checks on literal strings to be able to look up "cached" data. Instead, first create OnigString object(s), once per each literal string, keep it referenced for the time you need, and pass that to all regex operations.

This issue is still open is because we still haven't implemented "re-using" search results part (memoizing?). Would you be interested in sending in a PR?

zikaari avatar Nov 03 '18 08:11 zikaari

BTW did you encounter any performance issues even with the latest version of onigasm?

zikaari avatar Nov 03 '18 09:11 zikaari

@NeekSandhu

Thanks for the details! I'll try to go with the OnigString, and I'll back to you with the results. Yep, I encountered a few performance issues, however, it can be related to the browser, because earlier my program was running directly on V8.

pekunicki avatar Nov 03 '18 09:11 pekunicki

Hey! Firstly I want to say that I'm super grateful for all the work that you've done on onigasm, it has been super helpful for me while implementing vscode-textmate. I think I stumbled upon a computationally expensive string in CodeSandbox today. In this sandbox: https://codesandbox.io/s/o1q60l9lq9, test.js has https://github.com/mathiasbynens/emoji-regex/blob/master/index.js as snippet, it seems to choke on the regexp line.

I'd like to help with looking into this, do you know where I could take a look?

CompuIves avatar Mar 25 '19 20:03 CompuIves

Really interesting case you've got there!

It appears to me that it's a Regex ~~matching~~ fighting against Regex i.e a regex expression being tokenized by a pattern. Perhaps the intense granularity is causing the slowdown? But the same case works seamlessly in VSCode.

In Chrome, opening DevTools before it chokes, and then pausing the execution while it chokes, it always pauses inside a wasm file. So I think it's safe to say that something isn't right in the C binding code. We'll find out.

I've read that Firefox added support for WASM sourcemaps recently, that might be a great start. Besides if you'd like to get your hands dirty, I'd invite you to have a look at scripts/build.sh file to see our build pipeline.

You'll need the usual suspects installed like clang, emscripten and few other things as seen in build.sh file.

I'd happy to look at this, but unfortunately, I'm not certain when I might get a chance.

I'll keep on passive investigation though :)

zikaari avatar Mar 26 '19 08:03 zikaari

The runtime concepts are quite simple, yet easy to mess up:

To pass data from JS to C, a malloc call is made on JS side for size of that said data. Data is copied over to the retured pointer. The pointer, which is a number is what gets passed on to the C runtime (along with the size)

Likewise, when going from C to JS, similar approach just in opposite direction.

Rule of thumb? Only integers can travel between JS and C. Complex objects must be encoded appropriately, converted to Uint8Array, copied over to HEAPU8, and on the other side, the said data is read from HEAPU8 starting from pointer, to pointer + size.

zikaari avatar Mar 26 '19 09:03 zikaari