hs-gauge Negative time value in analysis

It shouldn't really happend, but some people are seeing negative time value in reports:

e.g.

benchmarked xxx
time                 136.9 ms   (-340.9 ms .. 511.1 ms)
                     0.063 R²   (0.000 R² .. 0.999 R²)
mean                 1.010 s    (425.0 ms .. 3.335 s)
std dev              1.833 s    (12.24 ms .. 3.048 s)```

Feb 12 '18 06:02 vincenthz

It looks like this is an analysis problem and not a measurement problem. I saw this:

analysing with 1000 resamples
bootstrapping with 3 samples
benchmarked compose/mapM/list-transformer
time                 857.6 ms   (-7.8041e15 s .. 3.9813e15 s)
                     1.000 R²   (-Infinity R² .. 1.000 R²)
mean                 874.2 ms   (866.4 ms .. 880.1 ms)
std dev              8.968 ms   (0.0 s .. 10.16 ms)
variance introduced by outliers: 22% (moderately inflated)
iters                2          (1 .. 3)
time                 874.2 ms   (866.4 ms .. 884.0 ms)
cycles               1918853623 (1901739749 .. 1940363236)
cpuTime              873.9 ms   (866.3 ms .. 884.1 ms)
maxrss               3548501    (3543040 .. 3551232)
nivcsw               163        (150 .. 179)
allocated            5551622661 (2775661592 .. 8327940760)
numGcs               2663       (2663 .. 2663)
bytesCopied          1016694    (1016411 .. 1017208)
mutatorWallSeconds   862.5 ms   (855.3 ms .. 872.1 ms)
mutatorCpuSeconds    862.5 ms   (855.3 ms .. 872.1 ms)
gcWallSeconds        5.659 ms   (5.539 ms .. 5.724 ms)
gcCpuSeconds         7.067 ms   (6.926 ms .. 7.143 ms)

The measured values of time seem ok and mean also seems to be in line with that. However, the time reported by analysis and R^2 seems to have some anomaly. Look at that range, both min and max seem to be totally out of the whack (-7.8041e15 s .. 3.9813e15 s). I am tempted to take a look at this but am resisting myself because of the other pressing work.

Mar 19 '18 12:03 harendra-kumar

My guess is that it is due to insufficient number of samples. To confirm the hypothesis I looked at the benchmarks in which this problem was seen and all of them are the ones in which the benchmark took too much time and because of that the total number of samples were reduced to 2-3.

Mar 20 '18 03:03 harendra-kumar

It is confirmed, the problem is due to the number of samples, you can reproduce it by passing --time-limit 0 --min-samples 3. If you use 3 or less samples you will see strange numbers. With 4 samples NaN is seen. With 5 samples there are no big numbers but still the resampled range is outside the normal measured range. Only when the number of samples are sufficiently large e.g. 10 then the numbers look all close to normal. Need to understand how the statistical analysis is working.

Mar 20 '18 03:03 harendra-kumar

Without understanding the statistical analysis deeply or fixing it, a cheap and quick way to avoid this is to provide a warning when the number of samples is less than 8 that the number of samples are insufficient and the analysis may not be correct.

Mar 20 '18 13:03 harendra-kumar

the R^2 coefficient measures the average slope of the linear regression, and in the limit of few samples this is basically meaningless. Any model is plagued by high variance predictions when the samples are too few w.r.t the model complexity. There are two ways forward as I see it: warn the user about the limitations of the model or implement a more realistic model, that takes into account the actual data distribution etc.

Mar 20 '18 13:03 ocramz

Thanks for the insight @ocramz ! Warning looks like the way to go for now. Tweaking the model and specially getting it right and stable may be a bigger effort and someone has to take that up.

Mar 20 '18 14:03 harendra-kumar

We need to decide the constant for warning, how many samples are sufficient? Currently the default limit that we have set in the measurements is 10 samples, that is the max we get by default. I am wondering if that itself is low?

Mar 20 '18 14:03 harendra-kumar

The R^2 coefficient improves quite a bit if I increase the sample duration from the default 5 seconds to 10 seconds:

cutlass:/vol/hosts/cueball/workspace/github/hs-gauge  (help)$ stack bench --benchmark-arguments "--min-bench-duration 1"
gauge-0.2.1: benchmarks
Running 1 benchmarks...
Benchmark self: RUNNING...
benchmarked identity
time                 56.14 ns   (46.32 ns .. 72.36 ns)
                     0.907 R²   (0.737 R² .. 0.995 R²)
mean                 53.15 ns   (51.93 ns .. 55.03 ns)
std dev              2.619 ns   (788.9 ps .. 3.458 ns)

benchmarked slow
time                 160.4 ns   (116.2 ns .. 219.6 ns)
                     0.852 R²   (0.710 R² .. 0.963 R²)
mean                 192.2 ns   (186.2 ns .. 200.6 ns)
std dev              11.87 ns   (8.599 ns .. 16.05 ns)
variance introduced by outliers: 18% (moderately inflated)

Benchmark self: FINISH
cutlass:/vol/hosts/cueball/workspace/github/hs-gauge  (help)$ stack bench --benchmark-arguments "--min-bench-duration 10"
gauge-0.2.1: benchmarks
Running 1 benchmarks...
Benchmark self: RUNNING...
benchmarked identity
time                 54.46 ns   (53.31 ns .. 55.73 ns)
                     0.997 R²   (0.995 R² .. 0.999 R²)
mean                 54.28 ns   (53.67 ns .. 54.95 ns)
std dev              2.562 ns   (2.166 ns .. 3.036 ns)
variance introduced by outliers: 33% (moderately inflated)

benchmarked slow
time                 184.5 ns   (182.0 ns .. 187.6 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 184.8 ns   (183.1 ns .. 187.7 ns)
std dev              8.629 ns   (5.787 ns .. 15.28 ns)
variance introduced by outliers: 33% (moderately inflated)

Benchmark self: FINISH

Should we consider increasing the default a little bit more? It will make the measurement slower though.

Mar 21 '18 02:03 harendra-kumar

Sorry, it was 1 sec vs 10 sec and not the default (5s) vs 10 sec. I did a few more experiments and it seems upto sample duration of 2 seconds the R^2 is ok below that it degrades rapidly in this particular benchmark.

Mar 21 '18 02:03 harendra-kumar

This seems to happen also with larger sample sizes when variance is high:

benchmarking difference-disj_tn_swap ... took 22.83 s, total 420091 iterations
benchmarked difference-disj_tn_swap
time                 -297.0 ns  (-2.343 μs .. 558.6 ns)
                     0.019 R²   (0.000 R² .. 0.700 R²)
mean                 2.823 μs   (2.307 μs .. 3.623 μs)
std dev              1.184 μs   (715.8 ns .. 1.452 μs)
variance introduced by outliers: 89% (severely inflated)

So just increasing the sample time will not guarantee that this doesn't happen.

Jun 27 '19 18:06 AndreasPK

Just noticed the same thing! Also with a function taking only a very small amount of itme. I just switched reflex to use gauge and one of the benchmarks gave this:

benchmarked micro/subscribeMerge(10000)
time                 -322.9 ns  (-1.290 μs .. 308.3 ns)
                     0.053 R²   (0.000 R² .. 0.519 R²)
mean                 12.38 μs   (5.727 μs .. 22.08 μs)
std dev              16.73 μs   (10.19 μs .. 21.62 μs)
variance introduced by outliers: 94% (severely inflated)

Aug 09 '19 06:08 oliver-batchelor