Negative time value in analysis
It shouldn't really happend, but some people are seeing negative time value in reports:
e.g.
benchmarked xxx
time 136.9 ms (-340.9 ms .. 511.1 ms)
0.063 R² (0.000 R² .. 0.999 R²)
mean 1.010 s (425.0 ms .. 3.335 s)
std dev 1.833 s (12.24 ms .. 3.048 s)```
It looks like this is an analysis problem and not a measurement problem. I saw this:
analysing with 1000 resamples
bootstrapping with 3 samples
benchmarked compose/mapM/list-transformer
time 857.6 ms (-7.8041e15 s .. 3.9813e15 s)
1.000 R² (-Infinity R² .. 1.000 R²)
mean 874.2 ms (866.4 ms .. 880.1 ms)
std dev 8.968 ms (0.0 s .. 10.16 ms)
variance introduced by outliers: 22% (moderately inflated)
iters 2 (1 .. 3)
time 874.2 ms (866.4 ms .. 884.0 ms)
cycles 1918853623 (1901739749 .. 1940363236)
cpuTime 873.9 ms (866.3 ms .. 884.1 ms)
maxrss 3548501 (3543040 .. 3551232)
nivcsw 163 (150 .. 179)
allocated 5551622661 (2775661592 .. 8327940760)
numGcs 2663 (2663 .. 2663)
bytesCopied 1016694 (1016411 .. 1017208)
mutatorWallSeconds 862.5 ms (855.3 ms .. 872.1 ms)
mutatorCpuSeconds 862.5 ms (855.3 ms .. 872.1 ms)
gcWallSeconds 5.659 ms (5.539 ms .. 5.724 ms)
gcCpuSeconds 7.067 ms (6.926 ms .. 7.143 ms)
The measured values of time seem ok and mean also seems to be in line with that. However, the time reported by analysis and R^2 seems to have some anomaly. Look at that range, both min and max seem to be totally out of the whack (-7.8041e15 s .. 3.9813e15 s). I am tempted to take a look at this but am resisting myself because of the other pressing work.
My guess is that it is due to insufficient number of samples. To confirm the hypothesis I looked at the benchmarks in which this problem was seen and all of them are the ones in which the benchmark took too much time and because of that the total number of samples were reduced to 2-3.
It is confirmed, the problem is due to the number of samples, you can reproduce it by passing --time-limit 0 --min-samples 3. If you use 3 or less samples you will see strange numbers. With 4 samples NaN is seen. With 5 samples there are no big numbers but still the resampled range is outside the normal measured range. Only when the number of samples are sufficiently large e.g. 10 then the numbers look all close to normal. Need to understand how the statistical analysis is working.
Without understanding the statistical analysis deeply or fixing it, a cheap and quick way to avoid this is to provide a warning when the number of samples is less than 8 that the number of samples are insufficient and the analysis may not be correct.
the R^2 coefficient measures the average slope of the linear regression, and in the limit of few samples this is basically meaningless. Any model is plagued by high variance predictions when the samples are too few w.r.t the model complexity. There are two ways forward as I see it: warn the user about the limitations of the model or implement a more realistic model, that takes into account the actual data distribution etc.
Thanks for the insight @ocramz ! Warning looks like the way to go for now. Tweaking the model and specially getting it right and stable may be a bigger effort and someone has to take that up.
We need to decide the constant for warning, how many samples are sufficient? Currently the default limit that we have set in the measurements is 10 samples, that is the max we get by default. I am wondering if that itself is low?
The R^2 coefficient improves quite a bit if I increase the sample duration from the default 5 seconds to 10 seconds:
cutlass:/vol/hosts/cueball/workspace/github/hs-gauge (help)$ stack bench --benchmark-arguments "--min-bench-duration 1"
gauge-0.2.1: benchmarks
Running 1 benchmarks...
Benchmark self: RUNNING...
benchmarked identity
time 56.14 ns (46.32 ns .. 72.36 ns)
0.907 R² (0.737 R² .. 0.995 R²)
mean 53.15 ns (51.93 ns .. 55.03 ns)
std dev 2.619 ns (788.9 ps .. 3.458 ns)
benchmarked slow
time 160.4 ns (116.2 ns .. 219.6 ns)
0.852 R² (0.710 R² .. 0.963 R²)
mean 192.2 ns (186.2 ns .. 200.6 ns)
std dev 11.87 ns (8.599 ns .. 16.05 ns)
variance introduced by outliers: 18% (moderately inflated)
Benchmark self: FINISH
cutlass:/vol/hosts/cueball/workspace/github/hs-gauge (help)$ stack bench --benchmark-arguments "--min-bench-duration 10"
gauge-0.2.1: benchmarks
Running 1 benchmarks...
Benchmark self: RUNNING...
benchmarked identity
time 54.46 ns (53.31 ns .. 55.73 ns)
0.997 R² (0.995 R² .. 0.999 R²)
mean 54.28 ns (53.67 ns .. 54.95 ns)
std dev 2.562 ns (2.166 ns .. 3.036 ns)
variance introduced by outliers: 33% (moderately inflated)
benchmarked slow
time 184.5 ns (182.0 ns .. 187.6 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 184.8 ns (183.1 ns .. 187.7 ns)
std dev 8.629 ns (5.787 ns .. 15.28 ns)
variance introduced by outliers: 33% (moderately inflated)
Benchmark self: FINISH
Should we consider increasing the default a little bit more? It will make the measurement slower though.
Sorry, it was 1 sec vs 10 sec and not the default (5s) vs 10 sec. I did a few more experiments and it seems upto sample duration of 2 seconds the R^2 is ok below that it degrades rapidly in this particular benchmark.
This seems to happen also with larger sample sizes when variance is high:
benchmarking difference-disj_tn_swap ... took 22.83 s, total 420091 iterations
benchmarked difference-disj_tn_swap
time -297.0 ns (-2.343 μs .. 558.6 ns)
0.019 R² (0.000 R² .. 0.700 R²)
mean 2.823 μs (2.307 μs .. 3.623 μs)
std dev 1.184 μs (715.8 ns .. 1.452 μs)
variance introduced by outliers: 89% (severely inflated)
So just increasing the sample time will not guarantee that this doesn't happen.
Just noticed the same thing! Also with a function taking only a very small amount of itme. I just switched reflex to use gauge and one of the benchmarks gave this:
benchmarked micro/subscribeMerge(10000)
time -322.9 ns (-1.290 μs .. 308.3 ns)
0.053 R² (0.000 R² .. 0.519 R²)
mean 12.38 μs (5.727 μs .. 22.08 μs)
std dev 16.73 μs (10.19 μs .. 21.62 μs)
variance introduced by outliers: 94% (severely inflated)