FrameworkBenchmarks Normalize JVM options for all JVM-based tests

Hi All - I'm raising this issue in regards to some of the JVM tests I've seen. It's going to be impossible to gain a true benchmark comparison of the JVM frameworks if arbitrarily many JVM flags are allowed upon setup. I would suggest a policy regarding common memory settings, JVM flags, and Garbage Collectors, so we don't run into discrepancies and unfair benchmarks simply because some tests are running with more amenable options.

For instance, Scala + Play2 is running with the following flags:

-J-server -J-Xms1g -J-Xmx1g -J-XX:NewSize=512m -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=30 -J-XX:-UseBiasedLocking -J-XX:+AlwaysPreTouch

Which are highly nonstandard, while Scala + http4s is running a relatively honest campaign with:

java -jar target/scala-2.12/http4s*one-jar.jar "${DBHOST}" &

Other languages such as Clojure + Aleph are running nonstandard options as well in setup.sh:

java -server -Xmx2g -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -jar target/*-standalone.jar

Before the next benchmark, would it be possible to enforce some form of normality, or is it just a free-for-all?

Mar 22 '18 02:03 emilypi

There have been some efforts and discussions about this in the past (e.g., #2652, and offline/other media).

We do have a force that pushes against configuration normalization. It's not a strong force, but it's something we should discuss. We have been and continue to be permissive of frameworks and platforms being opinionated about production deployment configuration.

For example, our historical stance has been that using fine-tuning options such as -XX:MaxGCPauseMillis=10 is acceptable for individual frameworks at their discretion.

I see three options:

Leave things as is, which is highly permissive, but also chaotic in that each framework contribution is expected to tune for our test environment(s), which is itself not a simple matter. I agree with @emilypi: this causes counterproductive variance.
Specify and require a specific set of JVM arguments and disallow variation. This potentially maximizes the utility of direct comparisons, but marginalizes any frameworks that are opinionated about fine-tuning options.
Specify and require some arguments of significant impact such as heap size and garbage collector (e.g., G1), but allow custom fine-tuning such as -XX:MaxGCPauseMillis=10.

Presently, I am leaning toward option 3, but would like to hear further opinions.

Note that enforcement of either 2 or 3 would need to be an ongoing community effort. We might want to provide platform-specific instructions to contributors at some point, which could help communicate specific guidance such as this. But to set expectations, even with such documentation, it would not surprise me if future JVM tests slip through without complying.

Another matter, which has been raised before, is allowing for arguments to vary between environment. That is, provide a different set of arguments for Citrine (physical hardware) versus Azure (cloud). This matter should probably be a separate issue.

Mar 22 '18 16:03 bhauer

Thanks for the rundown @bhauer. I'm glad there's been discussion surrounding this, and I think option 3 is certainly a good compromise. As you said, some frameworks will necessarily require slight variance due to their bias, and defining a set of high impact items that should be standard across tests would certainly solve the problem.

I recognize to some degree that people maintaining these projects will need to police themselves a bit, and I'd be happy to do my part in pointing out problems as I go through these tests. We'll see what others have to say 😄

Mar 22 '18 17:03 emilypi

Where do y'all currently stand on this issue and is it something that the community could help with?

Oct 04 '18 23:10 JamesMcMahon

One thing I would add is the JVM options are not the only noise affecting the signal. For instance, the Vertx tests use a different HTML templating engine then Spring tests (Rocker vs Mustache).

Not sure how much that effects rendering the HTML in the Fortune test, but I would guess there is a difference, and when comparing tests it muddies the water around framework performance differences.

Oct 04 '18 23:10 JamesMcMahon

I'll approach the matter from another angle: JVM parameters are only one side of story. The used libraries (including template engines) and database drivers are also very significant. Take for example vertx, vertx-web and wizarrdo-http. Currently vertx and vertx-web are using relatively old version of reactive-pg-client. In the recent results from Citrine environment wizarrdo-http have taken the lead in the Single query, Multiple queries and Data updates. I'm betting that upgrading the driver to a later version will put vertx, vertx-web and wizarddo-http neck-a-neck.

The details: https://github.com/TechEmpower/FrameworkBenchmarks/blob/c06a3b02f5097a87461781c7bcfdecab37b0459c/frameworks/Java/vertx-web/pom.xml#L36-L40 https://github.com/TechEmpower/FrameworkBenchmarks/blob/c06a3b02f5097a87461781c7bcfdecab37b0459c/frameworks/Java/vertx/pom.xml#L20-L24 https://github.com/TechEmpower/FrameworkBenchmarks/blob/c06a3b02f5097a87461781c7bcfdecab37b0459c/frameworks/Java/wizzardo-http/build.gradle#L19-L22

Next let's check the recent entry es4x, which is JavaScript API for the vertx framework. https://twitter.com/TFBenchmarks retwitted this https://twitter.com/pml0pes/status/1044555559670861824. But checkout the dependencies: https://github.com/TechEmpower/FrameworkBenchmarks/blob/c06a3b02f5097a87461781c7bcfdecab37b0459c/frameworks/JavaScript/es4x/package.json#L14-L18 The code is using more recent versions than the Java vertx dependencies. That's true for the vertx library version and the DB driver. Yes, the result of the GraalVM is very respectable but the comparison is flawed...

I'm also recalling the update of Jackson in the servlet project that helped the servlet to come at the top in the JSON serialization. Note that these results are on the same hardware. Round 13: 375,472 69.0% from the best Round 14: 560,548 100% from the best Round 15: 657,791 96.4% from the best

And this is the last reference to the history: https://github.com/TechEmpower/FrameworkBenchmarks/pull/2626

Another similar aspect - spring-webflux and netty (maybe other implementations are utilizing this technique, I haven't checked) are updating the time for the response headers once per second. It's a nice trick - you just have to be aware of it. Or the usage of separate JSON writer instance per thread from revenj.jvm.

Third similar problem: the configuration of the application server's thread pools, process instances, DB connection pools.

All of the above is preventing the direct comparison of the frameworks. So it's more than tricky. It always requires some digging in the source code before making conclusions. Running the tests on your own hardware is also very important. Maybe addition of another "implementation approach" where most of the stuff is normalized and nothing "exotic" is allowed will help but the effort for reviewing and contributing will become quite high. And the bar is already quite high - note how many participant don't have all the tests implemented.

So I'm for option 1) - free for all. If the authors have explored the JVM options and find benefits - let them. It's some sort of a reward. Yes, it's causing some inconvenience for the readers but encourages them to explore and learn.

Oct 09 '18 20:10 zloster

So I'm for option 1) - free for all. If the authors have explored the JVM options and find benefits - let them. It's some sort of a reward. Yes, it's causing some inconvenience for the readers but encourages them to explore and learn.

I was on board until this conclusion. I think you lost the plot a bit with all of this. This is supposed to be a benchmark. If we allow a free for all, then what exactly are we benchmarking? People's JVM lore skills? If so, one might as well not consider the validity of any of these benchmarks and change the name of the repo to Techempower Perf Games. The point of the benchmarks is to provide som semblance of understanding of the performance of a framework , under standardized hardware and load conditions. This is useful to people when deciding the merits of a framework. If it comes down to playing a game of "how many JVM flags can we switch on to achieve better performance", then we fail to achieve the standard conditions that say anything meaningful about a given test.

That is, unless Techempower would rather play the perf game, in which case, go for it. it's your repos 😄

Oct 11 '18 19:10 emilypi

In production you also get to tune VM parameters and drivers. So let it be up to each framework to decide it's best params/drivers to get the best results for that framework.

Mar 31 '19 10:03 flip111

I wonder if there will be a support for GraalVM More info about Performance on GraalVM https://blog.codecentric.de/en/2020/05/spring-boot-graalvm/

Jun 05 '20 06:06 amorenew

To be frank, perf games is not very far from what this repo actually serves as purpose, which is in itself not such a bad thing. I was pondering about it for some time, and realized that actually the perf games not only happens at Framework level but also at language/runtime level (see dotnet core improvements driven by analysis of framework benchmarks)

If we acknowledge this, then the way our could be another set of maintainers interested in seeing their favorite language at top positions instead of looking at specific frameworks. Benefit would be finding and reporting performance bottleneck patterns back upstream.

As result we should be seeing a convergence towards best of breed and generally higher quality of optimizations.

ofc we need to discuss opt-in vs opt-out, I'm currently favoring the former.

Sep 24 '23 10:09 otrosien

FrameworkBenchmarks FrameworkBenchmarks copied to clipboard

Normalize JVM options for all JVM-based tests

FrameworkBenchmarks
FrameworkBenchmarks copied to clipboard