ejml icon indicating copy to clipboard operation
ejml copied to clipboard

JMH For Automated Runtime Regression Check

Open lessthanoptimal opened this issue 4 years ago • 8 comments

  • Automatically run all JMH in the project and save the results
  • Then create a script that checks to see if there there is a change greater than 10% and flags those tests as potential regressions
  • Will need to handle idiots (i.e. me) accidentally comiting modification to a benchmark, like commenting out functions or changing the matrix size

Right now only a few dense real functions have any runtime regressions checks that are part of Java Matrix Benchmark. 10 functions takes about 1.5 days to run as it checks matrices of size 2 to 40,000. This would maybe check on small and one "large" matrix.

This would run twice a week and send an e-mail report. I've got python code for all of that.

lessthanoptimal avatar Oct 04 '20 03:10 lessthanoptimal

Finished and is up and running. See example output below.

Repeatability is going to be a significant issue and everything flagged is a false positive since the code hasn't changed. This could be caused by the CPU entering boost mode where it runs at full throttle until it gets too hot. It actually couldn't complete this benchmark until the NUC it's running on had been dusted. It's been running for about 2 years without issue previous to this.

EJML Runtime Regression

  files    benchmarks   flagged  exceptions
    46        523          7          0

Duration: 1.55 hrs
Date:     2020-12-30 03:22:31 UTC
Version:  0.41-SNAPSHOT
SHA:      37353c915cb053265e732a34fd109fb231e234a8
GIT_DATE: 2020-12-22T17:18:47Z

java.runtime.version:  14+36
java.vendor:           Azul Systems, Inc.
os.name+arch:          Linux amd64
os.version:            4.15.0-128-generic

Flagged:
   43.7% org.ejml.dense.row.BenchmarkCommonOps_ZDRM.csv:multAddTransB_alpha:1000
  174.4% org.ejml.dense.row.BenchmarkCommonOps_ZDRM.csv:multAddTransA_alpha:1000
  143.2% org.ejml.dense.row.BenchmarkCommonOps_DDRM.csv:scale_sA:5
   70.1% org.ejml.dense.row.BenchmarkCommonOps_DDRM.csv:elementDiv_AB:5
   56.2% org.ejml.dense.row.BenchmarkCommonOps_MT_DDRM.csv:transpose:5
   52.8% org.ejml.dense.row.BenchmarkCommonOps_MT_DDRM.csv:multTransAB_sAAA:5
   69.5% org.ejml.dense.row.decomposition.decompose.BenchmarkDecompositionCholesky_MT_DDRM.csv:block:100

lessthanoptimal avatar Dec 30 '20 03:12 lessthanoptimal

@FlorentinD @szarnyasg @breandan Any thoughts on improving repeatability? My current theory is posted in the comment above.

lessthanoptimal avatar Dec 30 '20 03:12 lessthanoptimal

I think its easy to read. The numbers after the benchmark names are parameters right? Maybe you could add the error rate for the flagged benchmarks to easier spot flaky ones.

FlorentinD avatar Dec 30 '20 08:12 FlorentinD

Unfortunately it can be very consistent in a single batch. It's currently configured to have 2 warm up runs, then 3 test runs. All 3 test runs might be within 5% of each other, then the next time the benchmark runs the test runs might be 2x slower, but still within 5%.. Adding error will be interesting information but it's not going to solve this problem.

lessthanoptimal avatar Dec 30 '20 14:12 lessthanoptimal

Good question. I don't think it's possible to run meaningful microbenchmarks on laptop hardware (including NUCs) when the performance differences between commits are small – as it's basically impossible to observe a few % of changes with so much noise. Cloud virtual machines are not much better. To get reasonably stable results, one would need to use workstation hardware or bare-metal servers.

This seems to be a problem which others have stumbled upon but I could not yet find a post on this. Loosely related posts:

szarnyasg avatar Dec 30 '20 14:12 szarnyasg

Thanks for the links. I should mention (and add to the output file) the current threshold is 40% change using either the current or baseline time as the divisor. These flagged operations do seem to be faster operations. I think I've got JME configured to run for a minimum of 1 second as it computes the average.

lessthanoptimal avatar Dec 30 '20 15:12 lessthanoptimal

Starting to look like this is an interesting problem, which is unfortunate since I was hoping it would be boring. One of those articles suggested using the minimum instead of the mean and pointed out that the distribution is likely to be a log-norm and not normal. I added a summary statistic that samples the distribution and it looks like around 85% of the tests are within 5% tolerance and 50% is within 0.7%.

I'm going to try to switch to using the min value and re-running tests which fail several times until they pass or there is sufficient evidence that something has changed.

lessthanoptimal avatar Jan 04 '21 19:01 lessthanoptimal

I've made a bit of progress on this front. See this https://github.com/lessthanoptimal/ejml/pull/125 for details. I've ordered a computer with a Xeon and maybe that will help. Until this I'll continue abusing this NUC.

lessthanoptimal avatar Jan 18 '21 19:01 lessthanoptimal

Finished this a while ago. Been working nicely ever since.

lessthanoptimal avatar Jan 14 '23 15:01 lessthanoptimal