ejml
ejml copied to clipboard
JMH For Automated Runtime Regression Check
- Automatically run all JMH in the project and save the results
- Then create a script that checks to see if there there is a change greater than 10% and flags those tests as potential regressions
- Will need to handle idiots (i.e. me) accidentally comiting modification to a benchmark, like commenting out functions or changing the matrix size
Right now only a few dense real functions have any runtime regressions checks that are part of Java Matrix Benchmark. 10 functions takes about 1.5 days to run as it checks matrices of size 2 to 40,000. This would maybe check on small and one "large" matrix.
This would run twice a week and send an e-mail report. I've got python code for all of that.
Finished and is up and running. See example output below.
Repeatability is going to be a significant issue and everything flagged is a false positive since the code hasn't changed. This could be caused by the CPU entering boost mode where it runs at full throttle until it gets too hot. It actually couldn't complete this benchmark until the NUC it's running on had been dusted. It's been running for about 2 years without issue previous to this.
EJML Runtime Regression
files benchmarks flagged exceptions
46 523 7 0
Duration: 1.55 hrs
Date: 2020-12-30 03:22:31 UTC
Version: 0.41-SNAPSHOT
SHA: 37353c915cb053265e732a34fd109fb231e234a8
GIT_DATE: 2020-12-22T17:18:47Z
java.runtime.version: 14+36
java.vendor: Azul Systems, Inc.
os.name+arch: Linux amd64
os.version: 4.15.0-128-generic
Flagged:
43.7% org.ejml.dense.row.BenchmarkCommonOps_ZDRM.csv:multAddTransB_alpha:1000
174.4% org.ejml.dense.row.BenchmarkCommonOps_ZDRM.csv:multAddTransA_alpha:1000
143.2% org.ejml.dense.row.BenchmarkCommonOps_DDRM.csv:scale_sA:5
70.1% org.ejml.dense.row.BenchmarkCommonOps_DDRM.csv:elementDiv_AB:5
56.2% org.ejml.dense.row.BenchmarkCommonOps_MT_DDRM.csv:transpose:5
52.8% org.ejml.dense.row.BenchmarkCommonOps_MT_DDRM.csv:multTransAB_sAAA:5
69.5% org.ejml.dense.row.decomposition.decompose.BenchmarkDecompositionCholesky_MT_DDRM.csv:block:100
@FlorentinD @szarnyasg @breandan Any thoughts on improving repeatability? My current theory is posted in the comment above.
I think its easy to read. The numbers after the benchmark names are parameters right? Maybe you could add the error rate for the flagged benchmarks to easier spot flaky ones.
Unfortunately it can be very consistent in a single batch. It's currently configured to have 2 warm up runs, then 3 test runs. All 3 test runs might be within 5% of each other, then the next time the benchmark runs the test runs might be 2x slower, but still within 5%.. Adding error will be interesting information but it's not going to solve this problem.
Good question. I don't think it's possible to run meaningful microbenchmarks on laptop hardware (including NUCs) when the performance differences between commits are small – as it's basically impossible to observe a few % of changes with so much noise. Cloud virtual machines are not much better. To get reasonably stable results, one would need to use workstation hardware or bare-metal servers.
This seems to be a problem which others have stumbled upon but I could not yet find a post on this. Loosely related posts:
Thanks for the links. I should mention (and add to the output file) the current threshold is 40% change using either the current or baseline time as the divisor. These flagged operations do seem to be faster operations. I think I've got JME configured to run for a minimum of 1 second as it computes the average.
Starting to look like this is an interesting problem, which is unfortunate since I was hoping it would be boring. One of those articles suggested using the minimum instead of the mean and pointed out that the distribution is likely to be a log-norm and not normal. I added a summary statistic that samples the distribution and it looks like around 85% of the tests are within 5% tolerance and 50% is within 0.7%.
I'm going to try to switch to using the min value and re-running tests which fail several times until they pass or there is sufficient evidence that something has changed.
I've made a bit of progress on this front. See this https://github.com/lessthanoptimal/ejml/pull/125 for details. I've ordered a computer with a Xeon and maybe that will help. Until this I'll continue abusing this NUC.
Finished this a while ago. Been working nicely ever since.