antlr-kotlin icon indicating copy to clipboard operation
antlr-kotlin copied to clipboard

Add benchmarks

Open lppedd opened this issue 5 months ago • 46 comments

I'd like to setup a couple of benchmarks using kotlinx-benchmark.
The benchmarks should measure something we can compare with other ANTLR implementations, e.g., antlr4ng.

@mike-lischke hey! Would you be ok for us to reproduce the same benchmarks for antlr-kotlin?

lppedd avatar Jan 29 '24 10:01 lppedd

Absolutely! It's all open source 😄

mike-lischke avatar Jan 29 '24 10:01 mike-lischke

Thank you!

To recap, it seems you:

  1. read a sample file
  2. split it based on a delimiter (but it's not a real split, you seem to use only the range to substring from the original file)
  3. for each range, create a new parser instance, ~so the single time measurement is for that single range~ Edit: nope, looks like the measurement is for the entire file

Is that correct?

lppedd avatar Jan 29 '24 10:01 lppedd

Top down it's rather so:

  1. The splitter test -> not relevant for you as it is only to measure how long it takes to determine statement ranges in a file. This hasn't to do with parsing and is just an additional number I wanted.
  2. Run the parser test once (cold run) -> collect the execution times and print them.
  3. Runt the parser 5 more times (warm run) -> collect the execution times.
  4. Remove the 2 slowest runs from step 3.
  5. Compute the average of the remaining 3 and print them.

A parser run consists of n loops, determined by the testFiles list. For each file:

  1. Determine statement ranges (not taken into the overall time needed).
  2. Go over each range and check if it matches a certain version number. I copied the tests from my VS Code unit tests where this matters. You could settle on a single version and remove all entries that don't it from the statements file. If the version matches, take a copy of the string part and parse that using a "parsing service", which is created on app start and kept in a separate class. So the parser is reused (otherwise you couldn't measure warm run times).
  3. Print any error found during parse.
  4. Compute the needed time and push it to the result list. So the length of the list correlates to the number of test files.

mike-lischke avatar Jan 29 '24 12:01 mike-lischke

Thanks!

I've managed to emulate the same workflow in Kotlin, and now I find myself dealing with a stack overflow on the JVM with this statement:

SELECT
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
1
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
java.lang.StackOverflowError
	at com.strumenta.antlrkotlin.benchmarks.generated.MySQLParser.simpleExpr(MySQLParser.kt:45124)
	at com.strumenta.antlrkotlin.benchmarks.generated.MySQLParser.bitExpr(MySQLParser.kt:44062)
	at com.strumenta.antlrkotlin.benchmarks.generated.MySQLParser.predicate(MySQLParser.kt:43685)
	at com.strumenta.antlrkotlin.benchmarks.generated.MySQLParser.boolPri(MySQLParser.kt:43457)
	at com.strumenta.antlrkotlin.benchmarks.generated.MySQLParser.expr(MySQLParser.kt:43137)
	at com.strumenta.antlrkotlin.benchmarks.generated.MySQLParser.exprList(MySQLParser.kt:50557)
	at com.strumenta.antlrkotlin.benchmarks.generated.MySQLParser.simpleExpr(MySQLParser.kt:45278)

Removing it completes the run successfully tho.

lppedd avatar Jan 29 '24 13:01 lppedd

Yes, that's a tough one. When parsing this in a thread in C++ it crash that thread unless you are on a platform that allows to set the thread stack size. This is a big problem in ANTLR as crashing a thread usually crashes the entire app (nice DDOS attack)!

The problem is the unlimited lookahead and the way prediction is implemented (as recursion instead of iteration). And additionally, expressions in MySQL (and many other languages) are highly nested so one opening par alone can result in a dozen or more sub rule invocations. That's why I love this one as a stress test.

Can you increase the stack size during execution?

mike-lischke avatar Jan 29 '24 14:01 mike-lischke

btw. please share numbers once you got it all running! And while you are on it, can you also run it against plain Java? I'm really interested in comparing everything.

mike-lischke avatar Jan 29 '24 14:01 mike-lischke

This is a big problem in ANTLR as crashing a thread usually crashes the entire app (nice DDOS attack)!

Some runtimes have methods that could check if there is enough free space in thread stack for one more recursive call. For example, EnsureSufficientExecutionStack in C#. In theory, such check can be integrated into targets that support such feature. For other runtimes, it's possible to use manual recursive call counter and calculate its limit empirically (maximum number of recursive calls).

Another way is to emulate recursive calls using separated stack. But it requires much more implementation effort.

Can you increase the stack size during execution?

Yes, it should be possible, but it's not a reliable resolving of the problem. SO exception should not occur in any case.

KvanTTT avatar Jan 29 '24 14:01 KvanTTT

The root cause should be removed instead of all these workarounds. In my next generation ANTLR project I plan to convert the recursive prediction code to become iterative and add an iteration limit. This has been requested before, to limit lookahead to some max value.

mike-lischke avatar Jan 29 '24 14:01 mike-lischke

Preliminary benchmark results on a i7-7820HQ - 32GB
The JS results are taken with measureTime, as kotlinx-benchmark for JS is blocked by https://github.com/Kotlin/kotlinx-benchmark/issues/185

File JS WASM JVM mingwX64 ANTLR4NG
statements.txt 523.2868 ms 226.27428 ms 102.962 ms 1481.9423 ms 668.6744333306 ms
bitrix_queries_cut.sql 282.2195 ms 123.57261 ms 56.586 ms 742.65691 ms 388.6220333178 ms
sakila-data.sql 21531.5333 ms 8683.92242 ms 3652.251 ms 58584.12632 ms 28444.441866676 ms

lppedd avatar Jan 29 '24 15:01 lppedd

I think on K/JS the major performance issue is the use of a string for the generated ATN, which must then be converted to a char array, and decoded to an int array. The JS and TS runtimes use an array of integers directly.

@KvanTTT not sure if we can do something to differentiate based on the platform.

lppedd avatar Jan 29 '24 15:01 lppedd

The root cause should be removed instead of all these workarounds. In my next generation ANTLR project I plan to convert the recursive prediction code to become iterative and add an iteration limit. This has been requested before, to limit lookahead to some max value.

@mike-lischke About your NG project, do you think that some ideas could be contributed to the just started effort on ANTLR 5? I am just asking out of curiosity as both your project and ANTLR 5 seem to aim to build the next generation of ANTLR and there is surely ground to cross-pollinate ideas

ftomassetti avatar Jan 29 '24 15:01 ftomassetti

The root cause should be removed instead of all these workarounds. In my next generation ANTLR project I plan to convert the recursive prediction code to become iterative and add an iteration limit.

Yes, it's a good solution, I also said about that.

I think on K/JS the major performance issue is the use of a string for the generated ATN, which must then be converted to a char array, and decoded to an int array.

I'm not sure it's a hot path since this String is being deserialized only once on app startup. But anyway I think using array of Int instead of String is a better option (it requires some effort on JVM side because it crashes on big arrays, but it's possible to split one huge int array to several more smaller ones).

@mike-lischke About your NG project, do you think that some ideas could be contributed to the just started effort on ANTLR 5? I am just asking out of curiosity as both your project and ANTLR 5 seem to aim to build the next generation of ANTLR and there is surely ground to cross-pollinate ideas

It's a great idea but I'll email you about the current situation (not sure it's public info).

KvanTTT avatar Jan 29 '24 15:01 KvanTTT

@KvanTTT Terence once wrote

Java does not properly handle static initialized integer arrays

So what he meant was that the JVM simply crashes?

There is a big difference between 150/200ms of antlr4ng and 300/600ms of K/JS, so I wonder what's so heavy on our side.

lppedd avatar Jan 29 '24 15:01 lppedd

So what he meant was that the JVM simply crashes?

True, unfortunately JVM can't handle huge arrays unlike other runtimes out of box. But it's probably possible to use direct memory or splitting into several smaller arrays. Actually, this array is only required only once during ATN deserialising and there is no need to keep it in memory.

There is a big difference between 150/200ms of antlr4ng and 300/600ms of K/JS, so I wonder what's so heavy on our side.

What's about other parsing stages? Probably they consume much more time.

KvanTTT avatar Jan 29 '24 15:01 KvanTTT

Preliminary benchmark results on a i7-7820HQ - 32GB The JS results are taken with measureTime, as kotlinx-benchmark for JS is blocked by Kotlin/kotlinx-benchmark#185

File JS JVM bitrix_queries_cut.sql 300ms 58ms statements.txt 600ms 107ms

The different machines might have an impact here. Have you tried running the pure JS benchmarks? My box is a Mac Studio M1 Max - 32 GB.

mike-lischke avatar Jan 29 '24 15:01 mike-lischke

I think on K/JS the major performance issue is the use of a string for the generated ATN, which must then be converted to a char array, and decoded to an int array. The JS and TS runtimes use an array of integers directly.

I don't think this is a real issue, given that this happens only once per parser/lexer class load. That would impact only the cold run time. Once the parser is warm the ATN is in memory.

mike-lischke avatar Jan 29 '24 15:01 mike-lischke

The root cause should be removed instead of all these workarounds. In my next generation ANTLR project I plan to convert the recursive prediction code to become iterative and add an iteration limit. This has been requested before, to limit lookahead to some max value.

@mike-lischke About your NG project, do you think that some ideas could be contributed to the just started effort on ANTLR 5? I am just asking out of curiosity as both your project and ANTLR 5 seem to aim to build the next generation of ANTLR and there is surely ground to cross-pollinate ideas

It was a little surprising to me that ANTLR5 got started. But that's so great about open source: everyone can do what he/she likes. I prefer TypeScript and have (mostly) other goals than ANTLR5. I'm on the way to radically simplify ANTLR (just today I brought the TS runtime tests to under 10 seconds, just by reorganizing things).

mike-lischke avatar Jan 29 '24 16:01 mike-lischke

True, unfortunately JVM can't handle huge arrays unlike other runtimes out of box.

Really? You can just allocate an array up to 2^31 entries (more than enough for the ATN). But I'm not a Java expert so...

But it's probably possible to use direct memory or splitting into several smaller arrays.

ANTLR started out with split arrays :-D A while ago there were sections that got concatenated at load time. Using a string however is all but optimal, as it involves charset related functionality for conversion. Still, it's not a hot path and hence not worth much work to optimize.

mike-lischke avatar Jan 29 '24 16:01 mike-lischke

@mike-lischke you were right about machine differences. I've cloned your repo and run the benchmarks.
I've updated the results, and they're comparable now, see table above.

lppedd avatar Jan 29 '24 16:01 lppedd

ANTLR started out with split arrays :-D A while ago there were sections that got concatenated at load time. Using a string however is all but optimal, as it involves charset related functionality for conversion. Still, it's not a hot path and hence not worth much work to optimize.

I also have wondered by there was a switch to using string, if we could just split the arrays. @KvanTTT am I right that is something you looked at a year or so ago (maybe in 2022)?

lppedd avatar Jan 29 '24 16:01 lppedd

I don't remember why it was implemented in that way, maybe it was just more easier. I've found a related PR: https://github.com/antlr/antlr4/pull/3591

KvanTTT avatar Jan 29 '24 18:01 KvanTTT

I'll open a PR as soon as the others are merged, just to avoid too much stacking.

I've removed all parse error catching, so if it fails it's unrecoverable, and I've not encountered any failures (apart from the stack overflow). Benchmark times are consistent, as per updated table.

lppedd avatar Jan 29 '24 23:01 lppedd

I've removed all parse error catching, so if it fails it's unrecoverable, and I've not encountered any failures (apart from the stack overflow).

You mean all catch clauses in generated code? Interesting idea, but besides the fact that recovery is no longer possible (is this really needed in everyday applications?) you also cannot do the double parse approach use LL and SLL prediction modes with the bail out error strategy.

mike-lischke avatar Jan 30 '24 08:01 mike-lischke

You mean all catch clauses in generated code

Oh no no. Initially I had copied the try-catch you have in the parsing service, which accumulates errors.
I did get it wrong tho, and by ignoring the caught exception I didn't know I had statements that were failing. It was because of wrong charsets, e.g., I had big5 instead of _big5.

Now I don't catch any exception in the benchmarking code.

lppedd avatar Jan 30 '24 10:01 lppedd

I've just added a benchmark for the WASM (wasmJs) target. See updated table.

Overall it seems WebAssembly on Node.js cuts the times in half or more compared to JS.

lppedd avatar Jan 31 '24 22:01 lppedd

Interesting numbers! If you could add C++ and C# (and maybe Go) too this would be the long awaited performance comparison chart for the ANTLR runtimes :-)

Can you explain more how the WASM target is structured? Is it just the runtime, used with a Java/Kotlin wrapper or is entirely compiled to a WASM module, including the generated parser/lexer? How's the WASM used then?

And another one: the JS and WASM output are both generated from Kotlin code? Btw. I'm surprised you can generate useful WASM code from Kotlin, given that this is marked as being in alpha state.

mike-lischke avatar Feb 01 '24 07:02 mike-lischke

Can you explain more how the WASM target is structured?

The runtime and the benchmark code are both compiled down to optimized WebAssembly binaries.
The only place where I do interop with JavaScript is to access Node's fs module to read files, but that's outside of the time measurements.

Kotlin outputs a small JS wrapper as entry point, which will load the WASM binary:

if (isNodeJs) {
  // ...
  const wasmBuffer = fs.readFileSync(path.resolve(dirpath, wasmFilePath));
  const wasmModule = new WebAssembly.Module(wasmBuffer);
  wasmInstance = new WebAssembly.Instance(wasmModule, importObject);
}

And another one: the JS and WASM output are both generated from Kotlin code?

Yes. Even tho the WebAssembly target is in alpha, the target itself has been around for quite some time, so I think the accumulated experience helped in getting it to work very well immediately.

The benchmark results are in line with what has been shown here.

lppedd avatar Feb 01 '24 08:02 lppedd

OK, thanks, that explains the good performance. Since Kotlin itself compiles to WASM, it makes no difference for the developer who's working with Kotlin code. However a, say, C++ developer, who wants the parser code to be in C++ cannot use this WASM binary, right? Same for any other language but Kotlin. Is that the correct understanding of the setup here?

mike-lischke avatar Feb 01 '24 09:02 mike-lischke

However a, say, C++ developer, who wants the parser code to be in C++ cannot use this WASM binary, right?

You can use the binary if you have a runtime for WASM, which is language/platform dependent.
In a JavaScript setting, both Node.js and the browser are themselves WASM runtimes, so you don't need any extra dependency.

Now, I'm not sure if the exported functions can actually be called by all runtimes, but I think it's going to be a yes in the future.

lppedd avatar Feb 01 '24 09:02 lppedd

IIUC the generated parsers, lexers, visitors and listeners are also part of the WASM binary, correct? If so that would be contrary to the idea of having these files in the target language.

mike-lischke avatar Feb 01 '24 09:02 mike-lischke