renaissance
renaissance copied to clipboard
How to get the stable scores?
Run twice with -r 100
, but the score of the first run is very different from the second, about 20% floating,
so, Is there a way to make it float less than 5%?
Which benchmark did you run, on what JVM, on what OS, on what hardware ?
Some benchmarks in the suite have relatively high variance. This is a sample of results I have collected recently with OpenJDK, the numbers are a coefficient of variation for each benchmark:
benchmark | CV between samples |
---|---|
chi-square | 6% |
dec-tree | 7% |
finagle-chirper | 2% |
fj-kmeans | 1% |
future-genetic | 3% |
log-regression | 11% |
mnemonics | 0% |
naive-bayes | 3% |
neo4j-analytics | 2% |
par-mnemonics | 8% |
philosophers | 1% |
reactors | 3% |
rx-scrabble | 1% |
scala-kmeans | 0% |
scala-stm-bench7 | 7% |
scrabble | 10% |
akka-uct | 4% |
als | 1% |
db-shootout | 26% |
dotty | 9% |
finagle-http | 4% |
gauss-mix | 2% |
movie-lens | 1% |
page-rank | 2% |
scala-doku | 1% |
Also, on some virtual machines, it is quite common that the JIT compiler produces slightly different code every time the benchmark is run, leading to different performance. To have a complete picture of performance, more runs are typically needed in such environments. Similar to above, here is a table that gives the coefficient of variation between separate runs of the same benchmark:
benchmark | CV between runs |
---|---|
chi-square | 12% |
dec-tree | 2% |
finagle-chirper | 6% |
fj-kmeans | 1% |
future-genetic | 2% |
log-regression | 3% |
mnemonics | 3% |
naive-bayes | 6% |
neo4j-analytics | 6% |
par-mnemonics | 3% |
philosophers | 2% |
reactors | 1% |
rx-scrabble | 1% |
scala-kmeans | 19% |
scala-stm-bench7 | 1% |
scrabble | 2% |
akka-uct | 1% |
als | 1% |
db-shootout | 4% |
dotty | 1% |
finagle-http | 2% |
gauss-mix | 34% |
movie-lens | 2% |
page-rank | 3% |
scala-doku | 1% |
As always, YMMV, but if you see wildly different results than above, please provide details.
@ceresek The benchmarks in my test results are log-regression and scala-kmeans, similar to yours, how should I test such benchmarks and determine the performance regression? thanks
Any info about the platform ? The usual recommendation is to start with platform factors that can impact performance variability - on the JVM side, that might be setting the heap size with -Xms
and -Xmx
switches to reduce heap resizing during execution, it is also useful to check the frequency scaling and power management settings of your platform, avoid hyperthreading and turboboosting, obviously also reducing any possible interfering load as much as possible (that means not running other applications in parallel, not running inside a virtual machine on shared hardware, and so on).
Once you make the precautions above, to the degree that your situation permits, the only other way to get a less variable performance estimate is to perform more measurements and compute with averages or confidence intervals. The usual approach is to (1) visually investigate the data to determine the benchmark warm up duration, and then discard any measurements from the warm up phase of the execution, (2) perform multiple runs, typically on the order of 10-30, and collect a few minutes of samples (assuming the warm up is also on the order of a few minutes), (3) compute a confidence interval for the mean using hierarchical bootstrap procedure. The width of the confidence interval (that is, in effect, the certainty in your result) will improve roughly with the square root of the number of samples collected (that is, if you would like twice as accurate an estimate, you will need roughly four times as many samples).
Not sure how much into the statistical side of things you are, if you want I can provide references to some scripts that do the required computations.
Thanks, it would be great if you provide references to scripts, I will try on the debian10 for x86_64, aarch64 and the new architecture loongarch64
This took a bit longer than expected, sorry - you can now see the scripts in the Renaissance R Utilities repo. The scripts are written in R but should be reasonably easy to run or even port if you need the computation in some other language.
Closing, please reopen or open an issue in the other repo if you would benefit from other output.