BenchmarkTools.jl More copy-able result for `@benchmark`

Sometimes users want to copy the result for comparison where @btime gives too little information but the current @benchmark is unfriendly for that.

One proposal is to change now:

julia> @benchmark sin(3)
BechmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.301 ns … 15.216 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.384 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.456 ns ±  0.294 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        █     ▅                                           ▂   
  ▅▃▂▁▁▃█▆▃▂▂▄█▄▃▂▁▂▄▃▂▂▁▁▃▆▃▂▁▂▁▂▅▃▂▁▁▁▂▄▄▂▂▂▁▁▂▄▃▃▂▁▁▁▁▅█▄ ▃
  1.3 ns         Histogram: frequency by time        1.65 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

to

julia> @benchmark sin(3)
BechmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.301 ns … 15.216 ns 
 Time  (median):     1.384 ns              
 Time  (mean ± σ):   1.456 ns ±  0.294 ns
 -----------------------------------------
 GC (min … max): 0.00% … 0.00%
 GC (median):    0.00%
 GC (mean ± σ):  0.00% ± 0.00%

        █     ▅                                           ▂   
  ▅▃▂▁▁▃█▆▃▂▂▄█▄▃▂▁▂▄▃▂▂▁▁▃▆▃▂▁▂▁▂▅▃▂▁▁▁▂▄▄▂▂▂▁▁▂▄▃▃▂▁▁▁▁▅█▄ ▃
  1.3 ns         Histogram: frequency by time        1.65 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

so that people can copy the first 3-4 lines without sending really long, wrapped lines.

Besides I think it's much easier to navigate vertically anyway

Jul 15 '21 13:07 Moelf

@tecosaur any easy improvements we could do?

Maybe add a compress variant of show?

Aug 13 '21 08:08 vchuravy

I can see the value in this, it's just a matter of how to best do it. Unfortunately, I've been rather busy the past few weeks, and will be for the immediate future — so you'll see a lot me or me commenting than me doing :stuck_out_tongue:.

Aug 25 '21 05:08 tecosaur

let me know what the desired syntax / behavior should be and I can give it a try.

Aug 25 '21 10:08 Moelf

addition: I think Memory estimate: 0 bytes, allocs estimate: 0. should also got to the top where can be copied together.

Sep 10 '21 12:09 Moelf

I suspect we could remove a lot of details without losing much. Has anyone ever got any value out of median GC percentage (whatever that is)? Putting the most important information first, you could do for instance:

julia> @benchmark sin(3)
BechmarkTools.Trial:
 Time: min 1.301 ns, mean 1.456 ns, max 15.216 ns 
 Memory: 0 allocations, 0 bytes
 GC: min 0.0 ns (0.0%), max 0.0 ns (0.0%)
 10_000 samples with 1_000 evaluations:

        █     ▅                                           ▂   
  ▅▃▂▁▁▃█▆▃▂▂▄█▄▃▂▁▂▄▃▂▂▁▁▃▆▃▂▁▂▁▂▅▃▂▁▁▁▂▄▄▂▂▂▁▁▂▄▃▃▂▁▁▁▁▅█▄ ▃
  1.3 ns                                             1.65 ns <

And perhaps "0 bytes// GC: min 0.0ns ..." can be omitted entirely when these are all zero, without being a confusing change of layout.

Since nobody runs just one benchmark, the case we should design for is having several on screen -- so being compact, and having all the information in one block, seem desirable.

The fact that the histogram auto-scales to look nice for one run works against this, e.g. in #237. While it's a bigger change than moving text around, a smart suggestion (Edit -- from https://github.com/JuliaCI/BenchmarkTools.jl/issues/253) was that perhaps the histogram should always have 0 at the left and a power of 10 on the right. Then (very often) they would be comparable between different functions.

Sep 10 '21 13:09 mcabbott

I don't disagree with the concerns, but for the proposed solutions I have a couple of counterpoints:

yes, I do like to paste benchmark results, and yes the oddities (esp. the vertical break) in the histogram are a bit weird. But even more, I want it to be useful when it's just me and my own laptop. Vertical space is more important now that we can align the horizontal axis across runs (#255 )
If we're slimming down what we print, we should still include some measure of variability, because interpreting differences (e.g., between two implementations) is always dicey if you have no measure of variability. Unfortunately, the total range (min..max) is almost useless. While standard deviation is much better than total range, it's still quite heavily influenced by outliers because each deviation is squared before averaging. Mean absolute deviation is more robust, and naturally pairs with the median (https://stats.stackexchange.com/a/7312). Hence I'd propose that the one required element of our textual summary should be median ± mean absolute deviation.

Sep 18 '21 09:09 timholy

Here's a real-world example of just how bad the standard deviation is for these purposes:

julia> show(ioctxfalse, MIME("text/plain"), b)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.860 μs … 764.893 μs  ┊ GC (min … max): 0.00% … 93.99%
 Time  (median):     19.471 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.375 μs ±  12.209 μs  ┊ GC (mean ± σ):  0.99% ±  1.65%

   █▃ ▃▁                                                        
  ▇██▆██▄▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  18.9 μs         Histogram: frequency by time         28.2 μs <

 Memory estimate: 15.84 KiB, allocs estimate: 4.
julia> meanad(x, med=median(x)) = mean(abs, x .- med)
meanad (generic function with 2 methods)

julia> meanad(b.times)    # in nanoseconds
1224.1091

julia> std(b.times)
12209.450625612251

I think anyone looking at that histogram would agree that 1us is much closer to the typical variability than 12us:

julia> sum(b.times .< median(b.times) + meanad(b.times)) / length(b.times)
0.8379

since 84% of the samples are less than median + 1 mean absolute deviation (20.7us).

Sep 18 '21 11:09 timholy

In defence of brevity, it seems that min 18.860 μs, mean 20.375 μs does a fair job of capturing the "about 1us wide" sentiment here. And the left tail is usually truncated like this, it's hard to have extreme outliers.

Unlike the right tail, of course. The max might have separately nice-to-know information, but (IMO) this right tail means the standard deviation is often useless.

Sep 18 '21 11:09 mcabbott

The mean can be way, way off too. min and median might be a fine choice, though.

Sep 18 '21 14:09 timholy

I don't mind showing more numbers, it's just trying editing this:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.860 μs … 764.893 μs  ┊ GC (min … max): 0.00% … 93.99%
 Time  (median):     19.471 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.375 μs ±  12.209 μs  ┊ GC (mean ± σ):  0.99% ±  1.65%

   █▃ ▃▁                                                        
  ▇██▆██▄▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  18.9 μs         Histogram: frequency by time         28.2 μs <

 Memory estimate: 15.84 KiB, allocs estimate: 4.

into

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.860 μs … 764.893 μs
 Time  (median):     19.471 μs              
 Time  (mean ± σ):   20.375 μs ±  12.209 μs
 Memory estimate: 15.84 KiB, allocs estimate: 4.

takes way too many mouse clicks right now. Try pasting the above example 3 times into some 55 characters wide Slack thread panel.

When there's no color encoding, 15 numbers and blocks wrapped around lines is just way too hard to spot key information.

Sep 18 '21 14:09 Moelf

In a chat now lost, I was persuaded by someone that a fairly common situation is to have (say) 90% of samples almost at the minimum, and 10% a bit higher when GC kicks in (plus perhaps a high maximum). In this kind of distribution the median is less useful than the mean. (Edit: https://github.com/JuliaCI/BenchmarkTools.jl/issues/237#issuecomment-922257927 is a nice example.)

The reason to print the histogram is that there's no one or two perfect numbers that are always what you should watch. But 5 (or in fact 7) times seems like a lot, it's hard to scan the page and remember where to look. Something like min/mean/max or min/median/max which has an obvious sequence would I think make it much easier to find the one you want; I guess min/median/mean/max isn't too bad either.

5 different GC numbers (of uncertain meaning) is way too many. Not printing them at all when they are all zero (which is common) would be great.

Sep 18 '21 15:09 mcabbott

Fair enough, if we're showing min and one other number, then mean is a better choice than median. I'd still guess that median and mean absolute deviation would be even more informative, but I'm too lazy to really check by e.g., computing the mutual information with the histogram itself. I'm fine with min and mean.

Sep 19 '21 08:09 timholy

Here's an attempt... with https://github.com/JuliaCI/BenchmarkTools.jl/issues/237#issuecomment-922498262 's suggestion of ~~95th~~ 99th percentile instead of max.

I think putting the times first, and in order, makes them easier to find.
The 99th percentile was already the right hand limit of the histogram, now less of a mystery? I've rounded these more aggressively too.
I added markers to the highlighted bars of the histogram.
It prints a few GC times, when there are allocations. Maybe the right ones? Hopefully clear what the percentages are?

When copied:

julia> @benchmark sum(log.(x))  setup=(x=rand(1000))
BenchmarkTools.Trial:
│  minimum 3.994 μs, median 4.607 μs (½), mean 5.036 μs (*), 99ᵗʰ 6.423 μs
│  1 allocations, 7.94 KiB
│  GC time: mean 379.234 ns (7.53%), maximum 345.470 μs (98.25%)
│  10_000 samples, each 7 evaluations:
│                      ½           *                                           
│                  ▂▄▇█▇▆▂▂                                                    
│  ▁▁▂▂▂▂▃▄▄▅▄▄▄▄▅▇█████████▇▅▅▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁ ▃
└  3.9 μs                                                              6.5 μs +

Sep 19 '21 22:09 mcabbott

It's hard to tell because the actual data collected are different, but it seems to reprint the median, not the mean? (Trick: use b = @benchmark ... and then toggle back and forth between the two displays with ;git stash and ;git stash pop assuming you're using Revise.)

But in general terms, the new display looks more easily interpretable for most users, and I like the decoration with markers for median and mean in the histogram!

Sep 20 '21 10:09 timholy

It was indeed printing things twice, sorry, now fixed. (~~Weirdly Revise isn't working for me, for the first time ever.~~ my fault!)

Sep 20 '21 12:09 mcabbott

LGTM!

Sep 21 '21 09:09 timholy

I really like the ½ and the *. However, personally I'd lay it out slightly differently.

julia> @benchmark sum(log.(x))  setup=(x=rand(1000))
BenchmarkTools.Trial: 10,000 samples, each 7 evaluations:
│  minimum 3.994 μs, median 4.607 μs (½), mean 5.036 μs (*), 99ᵗʰ 6.423 μs
│  GC time: mean 379.234 ns (7.53%), maximum 345.470 μs (98.25%)
│  1 allocations, 7.94 KiB
│                      ½           *                                           
│                  ▂▄▇█▇▆▂▂                                                    
│  ▁▁▂▂▂▂▃▄▄▅▄▄▄▄▅▇█████████▇▅▅▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁ ▃
└  3.9 μs                                                         6.5 μs +

Put the two time lines (min/med/mean/99 and GC) together before the memory (allocs) line
Move the number of samples back up with the trials
Reduce the default width of the histogram to be no wider than the text
Use a comma instead of an underscore when printing 10000 (minor stylistic choice)

I haven't worked this into the above, but I think it would be nice to have the stdev of the GC at least. For example, consider these two sets of example GC times

9, 10, 12, 13, 14, 15
0, 0.1, 0.1, 0.2, 15, 60

Both have a mean just over 12, but the first has a stdev of 2.3 while the second has a stdev of 24. So, I consider this information helpful in gauging how variable the GC time is.

Sep 21 '21 10:09 tecosaur

I think GC should go below allocation. We always care about time, and we always want to know if the allocation is larger than we expected (sometimes we expect 0, or expect N * bytes etc.). But GC can be something we don't or can't worry about. (and they can be zero)

Sep 21 '21 11:09 Moelf

Thanks for taking a close look. To explain my thinking a bit:

The GC times are below "2 allocations..." partly to give the 1st line breathing space, and partly so that you get to "0 allocations" before you get to the things which aren't there, when they aren't there. While all the times on the first line are comparable, different measures of the same thing, the GC times are not.
One reason to avoid 10,000 is that in many people's first language this is how decimals are written. The other reason is:

julia> 10,000
(10, 0)

Another possibility might be 10^4, since for large numbers it seems to have a strong preference for powers of 10.

The reason I moved the number of samples down is again partly breathing space for the 1st line, and partly to put the most important information first. If you want to copy the first few lines, this isn't the most important fact. Finally, the samples are literally what's plotted in the histogram, so it seems nice to say "10_000 samples, here there are:".
"stdev of the GC" is well-defined, but is it useful? Most runs usually have 0 GC time, but the point of the mean is to measure how much garbage you are generating per run (a property of your algorithm), regardless of how often the collector cleans up (which it decides, not you). I had it printing the 99th percentile GC time too, for a bit, but this is also usually zero.

julia> b = @benchmark sum(log.(x))  setup=(x=rand(1000));

julia> count(!iszero, b.gctimes), length(b.gctimes)
(7, 10000)

julia> mean(b.gctimes)  # 402 ns
402.1309428571428

julia> std(b.gctimes)  # 15 μs
15374.208785130586

julia> quantile(b.gctimes, 0.99)
0.0

The histogram is set to fit in 80 characters for now. I wondered about making it auto-size to match the first line, but then lines with times XXX.YYY will make it longer, so different ones won't line up.

Printing XXX.YYY is certainly too many digits. We could change the format to X.YYY, XX.YY, XXX.Y whatever that's called, which is still more than are justified but more consistent? Then they would be fixed width. I presume the reason this wasn't done is that groups of 3 fit with the steps between ms, μs, ns?

I also wonder if there should be less digits on the histogram ends. Partly in the hope that they'd more often line up between subsequent runs. Presently it rounds to 2 figures. Rounding to 1 means a step from 1.0 to 2.0 which is graphically too big, but maybe 10 steps per decade can be done nicely, not sure.

I made bars with zero counts print in light gray, to distinguish some from none. These will probably disappear on some people's themes (like paths in stack traces / @warn) but that's not super-confusing.
I thought the final < should be >, as it's said "less than", but perhaps there is no right way around... perhaps + is less ambiguous? Or something else like …?

Sep 21 '21 11:09 mcabbott

BenchmarkTools.jl BenchmarkTools.jl copied to clipboard

More copy-able result for `@benchmark`

BenchmarkTools.jl
BenchmarkTools.jl copied to clipboard